Now that we’ve wrapped the First Round of Zillow Prize, I encourage you to take some time and meet the winning teams. Zillow Prize drew more than 4,000 competitors and led to a fierce battle for the top spot. With Round 2 underway, we are taking a minute to look back at Round 1 and discuss some of the key takeaways we learned from not only the solutions themselves, but how to effectively leverage insights from running our first machine learning competition.
Photo Finish
It was a close contest all the way to the end, the difference between first place and 10th place was less than 0.5%. Even smaller differences separated many teams in the top 20 as the chart below shows. Interestingly, the top five or so teams had bigger separation in performance than teams ten to twenty which suggests that superior modeling skills do rise to the top with these type of contests. Also, note the sharp drop between 11th and 12th places – it looks like each of these teams found some signal that eluded the other four thousand folks.
What we were most looking forward to at the close of Round 1 was reviewing the solutions the teams submitted. The winning team of the first round, Zensemble built a model that improves upon the Zestimate in the greater Los Angeles area by 4.4% compared to the baseline model. As head of Data Science for the Zestimate, I’ve been anticipating the close of the competition for a long time. Despite the many high profile machine learning contests that have been held over the years on and off Kaggle, I had no idea what happens with a contest after the final leaderboard rankings are revealed. It turns out, for the competition host, this is a really interesting time and there is a lot to do if you want to maximize the insights you get from the competition.
Panning for Gold
It turns out that once you have the final results you still need to wait for winning participants to document and upload their models. That waiting time is a great time to do some analytics, I started with the performance chart but there is at least one more analysis that I’d highly recommend. After looking at the relative performance, look at the correlations between the top team’s solutions. Often, competing teams will arrive at the same or similar solutions to take them to the top of the leaderboard. But, it’s not guaranteed. A huge surprise for us here at Zillow was how uncorrelated a few of the winning solutions were.
The top two teams on the leaderboard, Zensemble and Juan Zhai, had solutions that were highly correlated suggesting that their modeling took similar approaches. Indeed, we later found out that this was the case. But, the next two teams, Silogram and Jack from Japan, had solutions that were weakly correlated with the top two solutions and each other! This means that among the top teams there were many different ways to improve on the Zestimate and that we should (and did) talk to them all to get an understanding of their approaches. Combining the highly uncorrelated solutions together is likely going to lead to a larger improvement in the Zestimate than taking improvements from the top teams solution alone. I’m hoping to achieve at least a 5% relative improvement but think more is possible when we combine all the solutions together! We also see from the correlational analysis that Alpha60 and ZhongTan are highly correlated to teams in the top four so they probably used a similar approach but that AI.choo might be up to something a little different. They currently sit atop the second round leaderboard so we’ll likely get a chance to chat with them before Zillow Prize is over too. So, we asked the top four teams to document and submit their models to Zillow.
Digging into the Solutions
Once the models were uploaded, my team and I here at Zillow set about reproducing them. While it can be tedious and takes some time, reproducing the winning solutions is very important as a first step at gaining an understanding of how the winning solutions work and gaining insight into the trade-off between performance and computing time. Some solutions took only a day or so to reproduce but in other cases things took longer as we encountered issues and worked with the winning teams to work them out. Many of the most vexing problems boiled down to configuration issues where we needed to have just the right combination of operating system, R or Python version, and the right package versions. My understanding is that this is pretty common for experience for competition hosts and teams. For the next round of Zillow Prize, we’re requiring that teams set-up their solutions inside Docker containers and provide us the Dockerfile that will set-up the right configurations and environment so that reproducing the solutions goes smoother and we can get to the fun part – finding out why the solutions work!
Since the Netflix Prize, machine learning competitions have also had a reputation for delivering complicated models designed to squeeze out every last bit of signal from the data no matter the computing cost. To help competition hosts cut to the core of what makes a model effective, Kaggle not only asks winners to provide their winning model but to also provide a simplified model that is as performant as possible. So, Zillow actually worked to reproduce and benchmark two models from each team and that lead to some fascinating insight into the performance vs. computing trade-off dilemma. For Zillow Prize this was quite the eye-opener as the differences in computing time between Zensemble and next three teams was quite large for a relatively small gain in performance. The simplified solutions also proved to be very important.
The simple models also showed us that we could get 99% of the improvements from the first round with a model that takes just over an hour to train! Not only were all the simplified models faster to train but they helped us (and the participants) better understand where real performance gains came from. One example, Jack found that an elaborate ensembling mechanism he was using was not necessary. In another case, team Juan Zhai even found that dropping a member from their ensemble that they had added in the last minute actually had hurt their performance. These simplified models also provided a lot of insight into what types of feature engineering were really effective. We will certainly be asking top teams in the final round of Zillow Prize to produce simplified versions of their models as part of receiving the prize money.
Asking Good Questions
My favorite part of the post-competition wrap-up came when we got an hour to chat with each of money winning teams about their solution. Kaggle refers to these video conferences as “Winners Calls” and I’m glad we saved them for last. By the time we had each winner call, we had already verified the solutions were fair and had been informally trading messages with the teams or a few weeks as we worked to reproduce their solutions. A few days before the call each team would send us a slide deck reviewing their solution for us to review. The Zestimate team would then review the deck and collect questions to ask the team when we got on the call. There were a few questions we asked of everyone regardless of the details of their solution. Probably the best of these was “What did you try out that didn’t work?” There were many interesting ideas that folks had tried that just didn’t pan out. One recurring theme we found with this question is that target based encoding in which you feed aggregates of the dependent variables (in this case Zestimate error) into your models as a feature was only good for overfitting your model. This validated some similar experiences our team at Zillow has had with this approach. We always had a lively Q&A session with the winners and I can’t wait for the next batch of calls a year from now!