Although analyzing “big data” has the power to transform your business, the ease of doing so has been over-stated. In reality, harnessing big data is still a messy and labor-intensive business. As an analytics professional, I am incredibly excited by what we can do with data, but I think some of the hype is doing us a disservice, because it creates a false expectation of how easy this work is going to be. Most things in life that are important and worthwhile are difficult, and the analysis of Big Data is no different. The solution is to take small steps, get started now with analyzing data with very specific objectives, accept that this is still very much a manual model building process, and build a staircase of successive small projects that build steadily over time into a transformative program. To begin with, don’t believe these commonly heard myths…
Big Data Myth #1: It’s Big
Big data isn’t big. And not only is “Big Data” poor English, but it’s also misleading. What we’re talking about is a large volume of data points, updated at high-frequency, with short lag to the actual event (real or near real-time). It’s very granular. It’s individual transaction data; it’s a certain credit card, paying for a certain amount of gas, at a certain gas station. Big Data is actually lots and lots of very small data. It’s not a landslide of data, it’s a sand storm. And sandstorms can blind and disorientate you. The Bedouin said a sandstorm could drive a man mad in 6 minutes. So, to help see in the storm, what other myths do we need to debunk?
Big Data Myth #2: Big Data analytics is an automated process
Well, first up is the notion of “real-time” analytics. Decision rules can be applied in real time; you add a digital camera to your Amazon cart, and the site asks you if you want to buy a memory card. However, projects to create these rules are still very much “projects.” They have a beginning, a middle and an end. This is reality quite a manual process. Even in-vogue, Netflix-style collaborative filters are layered with manually-built and ever changing rule engines. For the moment, at least, Big Data analytics is more often than not still a sausage-making project, and even more so with unstructured data like text, pictures and videos.
Big Data Myth #3: The more granular the data, the better
Is real-time and granular data always better? No, it’s not. You’ll miss the forest for seeing the trees. The first quarter of a football game doesn’t predict how a whole game plays out (remember the ‘49ers near-comeback). Real-time can be too close to the action. Sometimes, you need to pull back for the long shot to reveal what’s really going on.
Big data is encumbered by a huge amount of white noise. The noise as a proportion of the total signal increases with higher resolution, for example, data by minute rather than by week or data at a town level rather than state. Do not confuse precision with accuracy. Big Data, in its raw disaggregate form can be misleading. There needs to be an appropriate level of aggregation for all the white noise to cancel each out. So, all those grains of sands need to aggregate appropriately to make any sense of them.
Big Data Myth #4: Big Data is good data
There is a distinction between a lot of data and a lot of good data. As one of the pioneering generation of marketing analysts, I spent countless hours lining-up data so the rows matched in nice daily or weekly chunks, chasing missing data and trying to address anomalies. Poor quality data has lots of errors, lots of missing data that can be misleading. Photographs and videos can be tagged incorrectly, and is unstructured text written by teenagers reflecting a positive or negative sentiment? It takes a smart model to figure that out sometimes, to make sense of data, you need to throw some of it away. To analyze Big Data, one of the first things you have to figure out is what data to include in your analysis, and what you need to throw away. Bad data can lead you off the right track. It can sap countless weeks or months of imputation, definitions and realignment. Identifying and focusing on the most useful data can get you ahead.