What’s in a Number?
In the case the Zestimate, it’s a lot. Lloyd’s blog post Thursday announced that we officially came out of beta in concert with expanding our database and improving our accuracy. I wanted to offer a little more perspective on what goes into your Zestimate now.
When we first launched in 2006, we set out to do something no one else was doing – putting a value on millions of homes in America, for free. But not just any value. We wanted a value that we could update often (so consumers would always have a fresh valuation) and one that we could compute historically for all homes (so that consumers would have a sense for how prices changed over time). So, we spent some time looking at how others had approached this problem in the past, decided that none of these approaches would suffice for the accuracy, speed and scope of the problem we were trying to solve, and set about inventing a new way to do daily home valuations on millions of homes.
To get started, we borrowed heavily from the fields of machine learning, artificial intelligence, automated knowledge discovery and statistics, and from applications of these fields to the areas of genomic research, pattern detection for national security problems, computational biology, and automated equity trading. The result was an algorithm that was able to take records of recent sales and incorporate billions of unique home details (which varied in how these were defined and recorded by each county) in order to compute current Zestimate valuations with a pretty reasonable rate of accuracy. And do this all every night, for 40 million homes.
After we launched, we got the benefit of real-time input as we were constantly adding new information and data sources along with sales transactions. This helped expand our database and identify issues within markets. As we learned more about market variances, we knew we needed a more sophisticated algorithm that could digest the amount of intelligence we were gathering. In addition, we realized that some of our data sources were more limited than others and we wanted to offer owners and real estate agents the ability to update home facts since they have the most intimate knowledge of the property. We launched the ability to edit home facts in December of 2006 and soon after began constructing our new algorithm.
In the early stages of this construction task, our statisticians and analysts team spent extensive time looking at where the initial algorithm was challenged. For example, in California, our models didn’t always fully counteract the effects of Proposition 13 on tax assessments. In New York City, many homes don’t have the number of bedrooms and bathrooms listed in public data. In Chicago, you often only get number of bathrooms and square footage.
After the statisticians sketched out a prototype of a new algorithm designed to remedy or work around the many market variances, our talented software engineers contributed their own ideas of how to improve the algorithm, worked tirelessly to implement the whole thing, developed solutions to fill gaps that emerged during implementation and, in general, made the final product substantially better than where it started.
Concurrent to the effort on the valuation front, our data team was furiously working on bringing new data sources online with the aim of getting millions more homes into our data base of homes and to permit Zestimate valuations on homes already in our database but for which we were unable to value previously because of lack of sales data.
The result of these combined efforts over the last year is what you find today in your Zestimate, which depending on where you live, is up to 30% more accurate than before. How’d we do this? Our new algorithm has 20 times more statistical models than our original one, running approximately 334,000 models every day. For our release last week, this involved calculating Zestimates over the last 12 years on our entire database of 80 million homes – all told, we churned 4 terabytes of data using 67 million statistical models to calculate 13 billion Zestimates. That’s a whole lot of Zestimates. So many, in fact, that we turned to our friends up the hill here in Seattle to use 500 computing nodes at Amazon Web Services in order to compute all of the historical Zestimates. And, thanks to the efforts of the small cadre of people that bring new data into Zillow, we’ve got these better Zestimates on 14 million more Zestimates than before plus data on still another 10 million more homes.
The significant increase in models has allowed us to get much more granular and take into account more of those market variances when calculating a Zestimate – including looking at data at a more local level. As Lloyd pointed out, we are also now factoring in edited home facts, and in our current Zestimates, we’ve included 16 million new data points provided to us by users that have supported the accuracy gain. What’s more, the new Zillow algorithm is self-learning so as we incorporate more data – it becomes smarter. The result is a more robust and accurate database that, in turn, helps consumers become smarter about real estate.
All of this is not to say that we don’t still get it wrong sometimes. We certainly do, and sometimes in a big way. It is to say, however, that today we’ve taken a big step forward in accuracy and we’re only going to be taking more, smaller, quicker steps in the future. This couldn’t have been done without the hard work and unrelenting drive for perfection by the team at Zillow. There are a lot of bloodshot and tired eyes in getting this live for the world to see…and it was well worth it. And, there’s more to come.
Dr. Stan Humphries is a real estate economist and real estate expert for Zillow. Stan is in charge of the data and analytics team at Zillow, which develops housing market data for most major metropolitan statistical areas in the U.S., and provides economic research for current real estate market conditions. He helped create the algorithms for the popular Zestimate® home value and the Zillow Home Value Index (ZHVI).