Adding crowdsourcing and gamification to Big Data? That’s the combination that is at the heart of start-up Kaggle, a platform for predictive modelling and analytics competitions. The idea is quite simple: companies and researchers post their data and statisticians and data miners from all over the world compete to produce the best models. So far organizations such as NASA, Wikipedia, Deloitte and Allstate have been using Kaggle and its competitions. By far the most lucrative prize on Kaggle is a $3 million reward offered by Heritage Provider Network to the person who can most accurately forecast which patients will be admitted to a hospital within the next year by looking at their past insurance claims data. More than 1,000 people have downloaded the anonymized data that covers four years of hospital visits and they have until April 2013 to post answers.
This crowdsourcing approach is especially interesting for companies experimenting with big data or companies who are eager to find out what the data they already own can tell them. Kaggle offers a community of data scientists from quantitative fields such as computer science, statistics, econometrics, maths and physics to crunch the numbers for you. There are three ways companies can use Kaggle to put their data to work:
Identify is all about posting a sample of your dataset to the Kaggle community and let members explore the data, post comments and conduct analyses. Winning ideas are determined from a pool of the highest-voted proposals by a panel of judges, consisting of data scientists from the host organization and the Kaggle data science team. In Analyze-mode it all revolves around your data and a specific question. Set out the data-mission (public of privately with selected contenders) and a prize and let the community come up with answers. Finally in Implement-mode the Kaggle engine enables you to implement the winning model(s) from your competition and integrate them into existing systems.
208% prediction improvement
Dunnhumby, a U.K. firm that does analytics for supermarket chains, was looking to build a model to predict when supermarket shoppers will next visit the store and how much they will spend. Players in this data competition (also take a look at the competition page) were given a data set that included details of every visit made by 100,000 customers over a year. Customers were identified only by number and amount spent on a given date. Based on one year’s worth of purchasing data, players had to predict when each of the 100,000 customers would next visit the store, and how much they would spend on that visit.
Around 2000 entries entered the $10,000 prize competition over the course of two months. The winning entry, by D’yakonov Alexander a 32-year-old associate professor of mathematics at Moscow State University who used a method that gave more weight to recent visits to predict the next visit, was 208% more accurate than the existing benchmark.
Kaggle offers a way to gain insights in owned data and how to put this data to work in the future. A lot of businesses struggle with a lack of expertise and experience when it comes to big data, with Kaggle data experts are within close reach. Also, the most talented people often work for the biggest companies so Kaggle is especially interesting for smaller companies who do not have a in-company data scientist. Kaggle offers a community with experts that you can tap into without having to hire anyone. And if you decide to hire a data scientist, take a look at the Kaggle leaderboard. The Top 10 should have some interesting candidates for the position.