Needles in a Digital Hay Stack; Finding Value in Big Data

Big data is a scorching hot topic, currently capturing a lions share of the markets available stock of hyperbole and for good reason, data is growing at a meteoric rate.

As we continue to innovate, as business accelerates technology adoption, as the line bleeds between corporate and personal computing and as we interact more in digital mediums we are creating mountains of data. Much of this data is garbage, but some of it is gold (big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold).

Unfortunately with all overly hyped technologies there is a lot of misinformation, failed expectations and the inevitable trough of disillusionment, but that doesn’t mean you have to spend months or years curled up in a fetal position, disillusioned and wondering what went so wrong. With a thoughtful approach you can venture through the murky swamp of your big data and find the insights that provide your company a significant competitive and market advantage.

The current focus of big data has generally been on the management and federation of large amounts of data (volume), which may be structured, semi-structured, and unstructured (variety) that traverses corporate applications and infrastructure at a phenomenal rate (velocity). For example companies with a large population of online users can experience petabytes of clickstream data in very short time frames (daily – weekly – monthly), much of this data is incredibly difficult to manage and analyze with traditional technologies (relational DBs, column-store DBs, OLAP cubes, etc…) but there can be extremely valuable information contained in the patterns and behaviors identified within the data.

This challenge has caused the market to focus on the first order problem – how do we effectively collect, federate, and manage the massive amount and variety of data that is being generated by our business?

As a result we have seen some great technology, both commercial and open-source tools, and innovative approaches appear (massively parallel processing, NoSQL/NewSQL, schema-less databases and pattern-store indexing) to address this problem, but not enough attention or focus has been given to the real reason business needs to collect the data, which is to answer the question – how do we derive business value and insights from the data that can provide a competitive or market advantage?

I discuss this in an earlier post (bigdata-hadoop-and-the-impending-informationpocalypse)…

Hadoop is a wonderful distributed computing platform that can act as a data “lake”, “dump”, “warehouse”, “swamp”, whatever you call it you can pretty much dump anything in it without concern of mapping schema’s, worrying about metadata, ensuring APIs are functioning and all the other issues generally associated with managing data. That is awesome from a federation of data perspective, but it also makes using Hadoop to perform advanced queries challenging. You need additional tools (for example Hive/Pig), you need to be aware that Hadoop can be brittle and easily prone to failure when tickled too much, it isn’t easy to run queries and if you want to democratize data so that a general business user can run analysis – you wil need another solution. No doubt Hadoop (and its spawn) offers pretty radical improvements to how we manage and federate data, but it isn’t analytics and it doesn’t offer intelligence or knowledge without a lot of work.

Btw -I reject the term data lake and instead strongly suggest adopting the term data swamp, not for any other reason than there are a lot more witty observations one can make with swamps than lakes, so work with me people, but I digress…

The market for big data is evolving – it must. Organizations were initially looking for big data management solutions that enable them to deal with the massive volumes and variety of data…but the real value is derived from finding insights and patterns of behavior in the data, which requires organizations to look for big data analytics tools that empower organizations to turn their data into knowledge.

The idea that managing data results in the benefits gained from analyzing data seems to be the biggest misconception today and the number one reason organizations need to understand what problem they are trying to solve before they spend too much time pronouncing the data management project will somehow magically produce golden nuggets of knowledge, because it probably won’t.

You can’t just pounce on your data like a jungle cat attacking prey, you need to take it slow, ease into the data, take it out to dinner, massage its rough edges, help it lose a couple billion pounds, but most importantly spend some time really getting to know your data and then, with any luck, you can proudly wear your ‘Quant Star’ t-shirt around the office knowing you and your teams hard work and insights just made the company some serious coin

If nothing else, remember that there is a very large chasm between federating and collecting data and turning that data into actionable intelligence and business knowledge that offers real value.

2 thoughts on “Needles in a Digital Hay Stack; Finding Value in Big Data”

Securitygearguy | September 19, 2011 at 7:52 pm

I wish simplicity was an answer to things, The flow of info is making everything worse. Technology makes things sometimes to “appear” easy but they companies are making it more complicated to access things, therefore the quality of things is overclouded with garbage

Philip Favro | September 28, 2011 at 10:28 pm

Great post. The fact of the matter is that not all information is equal – some is critical, the rest is not. The majority of this information explosion is unstructured data. I work for Symantec and we conducted a survey that reveals companies tend to archive irrelevant data. The report indicates that 75 percent of backups are on legal hold or have infinite retention and yet the same customers estimated that 40 percent of information on legal hold is not relevant to litigation. If you’re interested, check out the report at http://bit.ly/oJRh0p.