“Information is not knowledge” – Albert Einstein
I recently read a couple of posts about BigData from my friend Chris Hoff – “Infosec Fail: The Problem With BigData is Little Data” and “More on Security and BigData…Where Data Analytics and Security Collide”
In these posts Hoff posits that the mass centralization of information will benefit the industry and that monitoring tools will experience a boon, especially those that leverage a cloud-computing architecture…
This will bring about a resurgence of DLP and monitoring tools using a variety of deployment methodologies via virtualization and cloud that was at first seen as a hinderance but will now be an incredible boon.
As Big Data and the databases/datastores it lives in interact with then proliferation of PaaS and SaaS offers, we have an opportunity to explore better ways of dealing with these problems — this is the benefit of mass centralization of information.
Hoff then goes on to describe how new data warehousing and analytics technologies, such as Hadoop, would positively impact the industry…
Even when we do start to be able to integrate and correlate event, configuration, vulnerability or logging data, it’s very IT-centric. It’s very INFRASTRUCTURE-centric. It doesn’t really include much value about the actual information in use/transit or the implication of how it’s being consumed or related to.
This is where using Big Data and collective pools of sourced “puddles” as part of a larger data “lake” and then mining it using toolsets such as Hadoop come into play…
There is nothing inherently wrong with these sentiments and no question that we are experiencing some exciting changes in the BI, Data and Analytics markets – not surprising considering that the market is about $9b and grew almost 13% in 2010 according to IDC – but we have to recognize that there are lots of limitations to these tools and if we don’t want to experience a 70-80% failure rate, which is what business intelligence deployments are tracking according to Gartner, then we need to set proper expectations.
Hadoop is a wonderful distributed computing platform that can act as a data “lake”, “dump”, “warehouse”, whatever you call it you can pretty much dump anything in it without concern of mapping schema’s, worrying about metadata, ensuring APIs are functioning and all the other issues generally associated with managing data. That is awesome from a federation of data perspective, but it also makes using Hadoop to perform advanced queries challenging. You need additional tools (Hive/Pig), you need to be aware that Hadoop an be brittle and easily prone to failure when tickled too much, it isn’t easy to run queries and if you want to democratize data so that a general business user can run analysis – you wil need another solution. No doubt Hadoop (and its spawn) offers pretty radical improvements to how we manage and federate data, but it isn’t analytics and it doesn’t offer intelligence or knowledge without a lot of work.
The major issue with data “lakes” is that for data to evolve into intelligence and knowledge requires a good understanding of the data itself – how else would one reconcile artifact ‘A’ with variable ‘B’ and context ‘C’ generated from 3 separate data sources . The problem is that most people don’t understand their data and they lack a-priori knowledge of its metadata or structure – dumping a bunch of data into a “lake” is a good step but it is hardly anywhere near the final step, nor is it the most challenging or beneficial.
I believe that the industry is heading quickly into the trough of disillusionment as people realize that there is a very large chasm between federating and collecting data and turning that data into actionable intelligence and business knowledge that offers real value.
Pingback: Data-Driven Security Trendspotting: Big Data « EMA Blog Community
Amrit, good to see you blogging! I was wondering if your were familiar with the big data engine and tools recently open sourced by Lexis-Nexis and if you think they may tackle some of these issues. It is called HPCC and the link to it is here: http://www.lexisnexis.com/risk/about/technology.aspx. I interviewed them for network world a while back and it seemed like in using this themselves for a number of years, they had developed a lot of the tools that tackle some of the issues you bring up. check it out and let me know what you think.
hope all else is well with you.
From what I understand they provide a solution to the first part of the data problem; that is a place to centrally store, index, and query billions of records; but they don’t add any intelligence on the information.
Remember: data -> information -> knowledge – Wisdom
Collecting, federating, storing, and managing data doesn’t move one to knowledge, it makes the information step attainable. Analysis that drives critical decisions, especially those that can be sophisticated and nuanced like security, requires the second major part of the data puzzle, which is analysis of the data and application of knowledge to the information or something like that 😉
Amrit, in the demo and interview they gave me I think they have some follow on tools they have developed that does give them this analysis. If you are interested I can introduce you to the CTO that runs the project and would be interested in what you think.
great post and riff off of Chris H’s post… my feeling is that we are so early that a sec model won’t be the problem…. it will just be different
As I commented already on the Securosis blog, collection and processing are only the starting point for extracting knowledge and eventually getting to wisdom. You need a way to explore the data; a way to make it actionable. Visualization is, in my humble opinion, the only way to unlock the potential! Hence my focus on the topic.
Pingback: Needles in a Digital Hay Stack; Finding Value in Big Data « Amrit Williams Blog
Pingback: Top 10 Most Overhyped Technology Terms « Amrit Williams Blog