Big Data: Principles and Examples Vol. 1
Big Data has become the subject of Big Hype, much as Social Media and Mobile were recently. Our goal today is to peel back the hype and discover some of the key principles behind Big Data so we can make the best possible decisions about when, where, and how to apply it.
My background with Big Data has predominantly been in retail, as Principal Engineer in Personalization at Amazon, and now Chief Scientist at RichRelevance, so I will use several retail examples. However, the principles behind these examples are without question more broadly applicable. These principles are:
- Before we look at any data, we have to have a clear and well-defined goal. Otherwise we are likely to find very clever solutions to the wrong problems.
- Smart data science requires the same fundamental scientific method—hypothesis, experimentation, and analysis—as every other science.
- Correlation is not causation. We all know this, but in a big data world it is much easier to confuse the two.
- Data are economic assets. Understanding them as such helps us understand how to motivate all participants in the data economy, from individuals to corporations to governments and non-profits.
The Netflix Prize
The Netflix Prize has done more to bring Big Data and data science in general to the public mind than any other event. This has been great for increasing the visibility of the field, but I’m sad to say, miserable for actual practice. The saddest part is that the winning algorithms are not in use at Netflix today, and are unlikely ever to be.
What went wrong? Fundamentally, the contest violated Principle 1. It did not ask contestants to optimize the right thing, which is what films to recommend to customers. Instead, the contest judged algorithms by how well they predict how customers would rate movies. So far that doesn’t sound completely illogical. If you know what I will rate highly, you can recommend it to me.
Unfortunately, the algorithms were judged across the ratings scale. So being able to tell the difference between something to which I would give one star, and something to which I would give two stars, was just as important as being able to tell the difference between something to which I would give four stars and something I to which I would give five.
Why does this matter? Well, chances are I would never recommend either the one- star or the two-star film. Does it really matter that I can tell the difference between films you despise and films you merely don’t like? It almost certainly does not. And it certainly does not matter nearly as much as knowing the difference between films you will love and those you will merely like.
So what went wrong? Principle 1 was violated, the data scientists were unleashed, and we got great solutions to the wrong problem. Netflix got a lot of press, the winners got some cash, but the solution never went into production.