The Inspiration Behind RecLab: Don't Bring the Data to the Code, Bring the Code to the Data

Last fall I had the pleasure of visiting Barcelona, one of my favorite cities on earth. It normally wouldn’t take much beyond the great Catalan food and wine to entice me to visit, but in this case there was an even more compelling reason to visit—to participate in the ACM Recommender Systems conference and spend time digging into deep technical conversations with some of the leading scientists and engineers in the field. I also had a chance to demonstrate Instant Shopper, a neat little demo we put together to illustrate the speed and effectiveness of RichRelevance algorithms that combine search and behavioral data on some of our merchant partners’ sites.

Behind the various algorithms and approaches presented at the conference, a long-recurring theme popped up just as it does every year. Researchers, particularly those in academia, are starved for access to real large-scale data that they can use to evaluate the effectiveness of recommendation algorithms. Companies like RichRelevance or Amazon that have petabytes of real live shopping data simply can’t share it with the research community without violating the trust and privacy of their shoppers.

Even with anonymized IDs that don’t reveal individuals’ names, it is simply too easy to reconstruct the identities of real people. This is exactly what happened in 2006 when AOL released anonymized data. Two weeks later, their CTO was forced to resign. Similarly, Netflix cancelled the second installment of their contest over fears that anonymous data could not be made anonymous enough.

I began a series of hallway and coffee break discussions with other conference attendees to see if there was anything RichRelevance could do to improve the situation. After all, if we could provide data to the research community, we could benefit from the algorithmic improvements they produced.

Beyond data, it was also clear to me that we needed to study interaction. That is, researchers don’t need just offline data, but also the ability to put their recommendations in front of real customers and see how they would react.

In considering the need for interaction, the solution was suddenly clear. We had been stuck on the idea of getting data to researchers and their code, but the real solution was to bring their code to out data, and enable it to interact with real live shoppers. From this a-ha moment, RecLab was born. Over subsequent months we designed an API and a set of tools to enable researchers to build recommenders that could ultimately run in the RichRelevance cloud. We also created documentation, testing infrastructure, and sample synthetic data, all under an open-source license. Researchers who use these APIs can submit their algorithms to run in our cloud. Data on our individual shoppers never leaves the cloud, but insights about algorithm performance does.

All of this infrastructure will be formally introduced at the CaRR2011 workshop next month, but the system is already available in beta form for any and all researchers who want an early look.

I’m thrilled to be able to make this system available to the community and I can’t wait to see what kind of algorithms and insights they come up with. It was definitely the best thing that came out of my trip to Barcelona—indeed it far exceeded even the incredible jamon Iberico, foie-de-mer and caviar tapas I had, which for me is saying an awful lot.

Data Mining & Algorithms