“Information is not knowledge” – Albert Einstein
I recently read a couple of posts about BigData from my friend Chris Hoff - “Infosec Fail: The Problem With BigData is Little Data” and “More on Security and BigData…Where Data Analytics and Security Collide”
In these posts Hoff posits that the mass centralization of information will benefit the industry and that monitoring tools will experience a boon, especially those that leverage a cloud-computing architecture…
This will bring about a resurgence of DLP and monitoring tools using a variety of deployment methodologies via virtualization and cloud that was at first seen as a hinderance but will now be an incredible boon.
As Big Data and the databases/datastores it lives in interact with then proliferation of PaaS and SaaS offers, we have an opportunity to explore better ways of dealing with these problems — this is the benefit of mass centralization of information.
Hoff then goes on to describe how new data warehousing and analytics technologies, such as Hadoop, would positively impact the industry…
Even when we do start to be able to integrate and correlate event, configuration, vulnerability or logging data, it’s very IT-centric. It’s very INFRASTRUCTURE-centric. It doesn’t really include much value about the actual information in use/transit or the implication of how it’s being consumed or related to.
This is where using Big Data and collective pools of sourced “puddles” as part of a larger data “lake” and then mining it using toolsets such as Hadoop come into play…
There is nothing inherently wrong with these sentiments and no question that we are experiencing some exciting changes in the BI, Data and Analytics markets – not surprising considering that the market is about $9b and grew almost 13% in 2010 according to IDC – but we have to recognize that there are lots of limitations to these tools and if we don’t want to experience a 70-80% failure rate, which is what business intelligence deployments are tracking according to Gartner, then we need to set proper expectations.
Hadoop is a wonderful distributed computing platform that can act as a data “lake”, “dump”, “warehouse”, whatever you call it you can pretty much dump anything in it without concern of mapping schema’s, worrying about metadata, ensuring APIs are functioning and all the other issues generally associated with managing data. That is awesome from a federation of data perspective, but it also makes using Hadoop to perform advanced queries challenging. You need additional tools (Hive/Pig), you need to be aware that Hadoop an be brittle and easily prone to failure when tickled too much, it isn’t easy to run queries and if you want to democratize data so that a general business user can run analysis – you wil need another solution. No doubt Hadoop (and its spawn) offers pretty radical improvements to how we manage and federate data, but it isn’t analytics and it doesn’t offer intelligence or knowledge without a lot of work.
The major issue with data “lakes” is that for data to evolve into intelligence and knowledge requires a good understanding of the data itself – how else would one reconcile artifact ’A’ with variable ‘B’ and context ‘C’ generated from 3 separate data sources . The problem is that most people don’t understand their data and they lack a-priori knowledge of its metadata or structure – dumping a bunch of data into a “lake” is a good step but it is hardly anywhere near the final step, nor is it the most challenging or beneficial.
I believe that the industry is heading quickly into the trough of disillusionment as people realize that there is a very large chasm between federating and collecting data and turning that data into actionable intelligence and business knowledge that offers real value.