Analytics has always been the sexy bit of data management. That’s where the nuggets of insight are teased to the surfaced and millions made by understanding why diapers sell beer or who is newly pregnant or how to route a jet so it burns 25% less fuel. But, behind that, there has always been the grunt work of extracting data from multiple, disparate sources, cleansing it of partial or bogus records, transforming it into a consistent and usable format, and loading it into the target analytics engine.
RichRelevance Inc. faces one of the prototypical big data challenges: lots of data, and not a lot of time to analyze it. For example, the marketing analytics services provider runs an online recommendation engine for Target, Sears, Neiman Marcus, Kohl’s and other retailers. Its predictive models, running on a Hadoop cluster, must be able to deliver product recommendations to shoppers in 40 to 60 milliseconds — not a simple task for a company that has two petabytes of customer and product data in its systems, a total that grows as retailers update and expand their online product catalogs.
Twenty-one years ago, a year before the first web browser appeared, Walmart’s Teradata data warehouse exceeded a terabyte of data and kicked off a revolution in supply-chain analytics. Today Hadoop is doing the same for demand-chain analytics. The question is, will we just add more zeros to our storage capacity this time or will we learn from our data warehouse infrastructure mistakes?These mistakes include:
- data silos,
- organizational silos, and
- confusing velocity with response time
Click here to read the full article.