In setting out to create a new home price index, a major problem Zillow sought to overcome in existing indices was their inability to deal with the changing composition of properties sold in one time period versus another time period. Both a median sale price index and a repeat sales index are vulnerable to such biases (see the analysis here for an example of how influential the bias can be). For example, if expensive homes sell at a disproportionately higher rate than less expensive homes in one time period, a median sale price index will characterize this market as experiencing price appreciation relative to the prior period of time even if the true value of homes is unchanged between the two periods.
The ideal home price index would be based off sale prices for the same set of homes in each time period so there was never an issue of the sales mix being different across periods. This approach of using a constant basket of goods is widely used, common examples being a commodity price index and a consumer price index. Unfortunately, unlike commodities and consumer goods, for which we can observe prices in all time periods, we can’t observe prices on the same set of homes in all time periods because not all homes are sold in every time period.
The innovation that Zillow developed in 2005 was a way of approximating this ideal home price index by leveraging the valuations Zillow creates on all homes (called Zestimates). Instead of actual sale prices on every home, the index is created from estimated sale prices on every home. While there is some estimation error associated with each estimated sale price (which we report here), this error is just as likely to be above the actual sale price of a home as below (in statistical terms, this is referred to as minimal systematic error). Because of this fact, the distribution of actual sale prices for homes sold in a given time period looks very similar to the distribution of estimated sale prices for this same set of homes. But, importantly, Zillow has estimated sale prices not just for the homes that sold, but for all homes even if they didn’t sell in that time period. From this data, a comprehensive and robust benchmark of home value trends can be computed which is immune to the changing mix of properties that sell in different periods of time (see Dorsey et al. (2010) for another recent discussion of this approach).
For an in-depth comparison of the Zillow Home Value Index to the Case Shiller Home Price Index, please refer to the Zillow Home Value Index Comparison to Case-Shiller
Each Zillow Home Value Index (ZHVI) is a time series tracking the monthly median home value in a particular geographical region. In general, each ZHVI time series begins in April 1996. We generate the ZHVI at seven geographic levels: neighborhood, ZIP code, city, congressional district, county, metropolitan area, state and the nation.
Estimated sale prices (Zestimates) are computed based on proprietary statistical and machine learning models. These models begin the estimation process by subdividing all of the homes in United States into micro-regions, or subsets of homes either near one another or similar in physical attributes to one another. Within each micro-region, the models observe recent sale transactions and learn the relative contribution of various home attributes in predicting the sale price. These home attributes include physical facts about the home and land, prior sale transactions, tax assessment information and geographic location. Based on the patterns learned, these models can then estimate sale prices on homes that have not yet sold.
The sale transactions from which the models learn patterns include all full-value, arms-length sales that are not foreclosure resales. The purpose of the Zestimate is to give consumers an indication of the fair value of a home under the assumption that it is sold as a conventional, non-foreclosure sale. Similarly, the purpose of the Zillow Home Value Index is to give consumers insight into the home value trends for homes that are not being sold out of foreclosure status. Zillow research indicates that homes sold as foreclosures have typical discounts relative to non-foreclosure sales of between 20 and 40 percent, depending on the foreclosure saturation of the market. This is not to say that the Zestimate is not influenced by foreclosure resales. Zestimates are, in fact, influenced by foreclosure sales, but the pathway of this influence is through the downward pressure foreclosure sales put on non-foreclosure sale prices. It is the price signal observed in the latter that we are attempting to measure and, in turn, predict with the Zestimate.
Within each region, we calculate the ZHVI for various subsets of homes (or market segments) so as to afford greater insight into what is happening in a particular market. All market segments are shown in the table below. Only residential properties are included in the ZHVI calculation. Non-residential properties, such as office buildings, shopping centers and farms are not included.
One very useful form of market segmentation that we produce is based on the distribution of home values within the metropolitan area. Here we assign properties into one of three tiers based on their Zestimates on a particular date: top, middle or bottom tier. The thresholds for the price tiers vary from metro to metro and are determined by the distribution of home values in each metro. Since Zestimates are time-dependent, a property may belong to different price tiers at different dates. To reduce tier switching, we exclude properties near the boundaries of price tiers when assigning tiers. Thus, the sum of Zestimates in all three tiers does not equal the number of Zestimates for the “All Homes” market segment.
Market Segment | Number of Zestimates | Description |
All Homes | 87.3 M | Single family + condominium + cooperative |
Single Family | 78.1 M | Single family only |
Condo | 9.2 M | Condominium + cooperative only |
0 or missing | 31.6 M | 0 Bedroom or missing |
1 Bedroom | 1.7 M | 1 Bedroom |
2 Bedroom | 11.1 M | 2 Bedroom |
3 Bedroom | 28.6 M | 3 Bedroom |
4 Bedroom | 11.7 M | 4 Bedroom |
5+Bedroom | 2.7 M | 5 Bedroom or more |
Top Tier | 27.0 M | Top price tier among homes within the same metropolitan |
Middle Tier | 27.0 M | Middle price tier among homes within the same metropolitan |
Bottom Tier | 27.0 M | Bottom price tier among homes within the same metropolitan |
Using the estimated market value of every home as represented in the Zestimate, the main steps in the construction of the ZHVI are as follows:
Let t be a discrete independent time variable with a value at the end of each month. Let H(t) be an M by N matrix with each element h_{ij}(t) representing the number of homes at time t for the i-th market segment in the j-th geographical region, where M is the total number of market segments and N is the total number of unique regions having a minimum required number of Zestimates. Currently, we have M=12 and N=77,590. Geographical regions include national, state, metro, county, city, ZIP code and neighborhood. The Number of Zestimates column in Table 1 above represents the number of homes in the i-th element of h_{ij} when j=’National’ and t=’Nov-2013’.
Let z_{ij}(t) be the vector of Zestimates of all homes at time t having length h_{ij}(t) for i-th market segment and j-th region. The raw median Zestimate, r_{ij}(t), for i-th market segment and j-th region is defined as:
r_{ij}(t)=Median(z_{ij}(t))
r_{ij}(t) is the median Zestimate and is an element of the M by N matrix R(t). In order to ensure reliability and stability, we only compute r_{ij} when h_{ij}(t) is above some minimum threshold. For November 2013, there are a total of 389,451 market segments by regions for which the median could be computed:
Count{r_{ij}(t) ≠NA, for i=1,..M and j=1,..N} is 389,451.
Market Segment | National | State | MSA | County | City | Neighborhood | Zip | Total |
All Homes | 1 | 51 | 917 | 2,830 | 23,057 | 8,664 | 24,460 | 59,980 |
Single Family | 1 | 51 | 917 | 2,828 | 22,976 | 8,068 | 24,249 | 59,090 |
Condo | 1 | 51 | 507 | 895 | 4,189 | 2,916 | 6,629 | 15,188 |
0 or missing | 1 | 51 | 868 | 2,464 | 14,023 | 3,304 | 15,097 | 35,808 |
1 Bedroom | 1 | 51 | 537 | 1,097 | 2,112 | 1,080 | 3,418 | 8,296 |
2 Bedroom | 1 | 51 | 742 | 1,821 | 9,173 | 3,083 | 11,870 | 27,461 |
3 Bedroom | 1 | 51 | 817 | 2,105 | 13,310 | 5,523 | 15,796 | 37,603 |
4 Bedroom | 1 | 51 | 766 | 1,829 | 8,633 | 3,124 | 11,485 | 25,889 |
5+Bedroom | 1 | 51 | 619 | 1,249 | 3,524 | 1,018 | 5,648 | 12,110 |
Top Tier | 1 | 51 | 913 | 1,681 | 12,554 | 4,112 | 14,862 | 34,184 |
Middle Tier | 1 | 51 | 913 | 1,704 | 14,058 | 4,877 | 16,364 | 37,968 |
Bottom Tier | 1 | 51 | 913 | 1,676 | 12,941 | 5,119 | 15,173 | 35,874 |
Zestimate errors are both time and region dependent. While the errors produced by the Zestimate algorithm are generally equally distributed above and below the actual sale price, there can be some residual systematic error detected once more historical sales are known (systematic error here is defined as the median raw error being slightly greater or less than zero). In this event, raw median Zestimates are adjusted through the use of a correction factor in the manner described below.
Let u_{ij}(t) be the median home value free of systematic error. Then, the raw median Zestimate can be expressed in terms of u_{ij}(t) as:
r_{ij}(t)= {1+ b_{j}(t)} * u_{ij}(t)
where b_{j}(t) is the systematic error in Zestimates representing the median fluctuation of Zestimates above or below the actual sold prices within the time window centered around t for the j-th region. We calculate the Zestimate systematic error as:
b_{j}(t)= Median({z_{j}(t-1)- s_{j}(t)}/s_{j}(t))
where s_{j}(t) is a vector of sale prices and z_{j}(t-1) are Zestimates corresponding to the same properties as s_{j}(t) but with the estimated sale price taken from the period immediately prior to the actual sale (to ensure that the estimate has not been influenced by the sale). The vector of sales, s_{j}(t), is obtained through the following approach:
After computing b_{j}(t), the adjusted median of Zestimates is an M by N matrix U(t) calculated as:
u_{ij}(t)= r_{ij}(t)/{1+ b_{j}(t)}
We apply a 5-term Henderson moving average filter to U(t) to reduce noise in the data. This filter was derived by Henderson, R. (1916). The filter weights applied in the middle of a time series are symmetric, while the end filter weights are asymmetric.
MA(t) = w_{1}U(t-2)+ w_{2}U(t-1)+w_{3}U(t)+ w_{4}U(t+1)+ w_{5}U(t+2)
where
w = (-0.07343, 0.29371, 0.55944, 0.29371, -0.07343) for the middle points: t = 3, 4, .. , TMax-2
w = (-0.04419, 0.29121, 0.52522, 0.22776, 0) for t = TMax -1
w = (0, 0.22776, 0.52522, 0.29121,-0.04419) for t = 2
w = (-0.13181, 0.36713, 0.76467,0, 0) for the end point: t = TMax
w = (0, 0, 0.76467, 0.36713, -0.13181) for the start point: t = 1
The resultant M by N matrix MA(t) is a smooth estimate of the median home value free of residual systematic error. This may not be as necessary for large regions such as the nation and states because of the large available data set, but it is applied to all levels for consistency.
Home sales are affected by seasons within the same year. Adjusting for seasonality is desirable so that the trend is more apparent for ease of comparison and forecasting. Since Zestimates and the ZHVI depend on sale prices, the time series MA(t) does contain some seasonality. We remove this seasonality using a seasonal-trend decomposition procedure (STL) based on the Loess method developed by Cleveland et al. (1990). STL is a filtering procedure for decomposing a time series into seasonal, trend and remainder components:
MA(t)= S(t)+T(t)+ RE(t)
where S(t), T(t) and RE(t) are the seasonal, trend and remainder components respectively. We remove seasonality by adding the trend and remainder components to form the seasonally adjusted ZHVI:
ZHVI(t)= T(t)+ RE(t)
The remainder component, RE(t), represents irregular features in the time series which we preserved.
The time series matrix ZHVI(t) has the same dimension as H(t) which is M by N (as noted, 12 x 77,590). While this theoretically could produce almost 1 million different time series, in practice many time series are eliminated because of data sparseness or temporal volatility. A four-star quality rating function is applied to every ZHVI time series. The variables feed to this function are features associated with each ZHVI time series. They include:
After suppressing those with star ratings below 2, there are 242,362 unique deliverable ZHVI time series for the report period ending November 2013.
Market Segment | National | State | MSA | County | City | Neighborhood | Zip | Total |
All Homes | 1 | 49 | 485 | 947 | 10,647 | 5,176 | 12,602 | 29,907 |
Single Family | 1 | 49 | 486 | 946 | 10,623 | 4,955 | 12,507 | 29,567 |
Condo | 1 | 48 | 347 | 600 | 3,467 | 2,087 | 5,434 | 11,984 |
0 or missing | 1 | 49 | 425 | 813 | 6,072 | 2,116 | 7,315 | 16,791 |
1 Bedroom | 1 | 47 | 311 | 539 | 1,469 | 762 | 2,388 | 5,517 |
2 Bedroom | 1 | 48 | 447 | 840 | 6,376 | 2,911 | 8,649 | 19,272 |
3 Bedroom | 1 | 48 | 511 | 998 | 8,746 | 4,140 | 11,131 | 25,575 |
4 Bedroom | 1 | 50 | 509 | 956 | 6,863 | 2,708 | 9,346 | 20,433 |
5+Bedroom | 1 | 46 | 426 | 769 | 3,073 | 959 | 4,986 | 10,260 |
Top Tier | 1 | 47 | 533 | 915 | 8,744 | 3,030 | 10,899 | 24,169 |
Middle Tier | 1 | 50 | 493 | 860 | 9,943 | 3,814 | 11,676 | 26,387 |
Bottom Tier | 1 | 48 | 486 | 831 | 7,927 | 3,302 | 9,905 | 22,500 |
ZHVIs for all geographic regions and market segments are updated every month. Since there is variable latency in Zillow’s receipt of transactional data from public records, Zillow’s estimate of residual systematic error can change as new transactions arrive. Historical estimates of systematic error are recalculated monthly and incorporated into revised ZHVI time series. As a result, there can be restatements of the ZHVI for up to three years from initial reporting date.
With the release of November 2013 Zillow Home Value Index (ZHVI) data, we have improved the underlying valuation model, introduced additional data filtering algorithms and developed a new approach to dealing with residual systematic error. The result of these changes led to a 24.1% increase in the number of regions for which Zillow reports a ZHVI (see Table 4). The historical values for the ZHVI have been restated with these changes, leading to slightly higher current estimate of the median home value nationally. The revised ZHVIs are qualitatively similar to the ZHVI computed using the previous methodology, although the new time series are significantly smoother.
Oct. ZHVI | Nov. ZHVI | Increase | |
States | 44 | 49 | 11.4% |
Metros | 389 | 485 | 24.7% |
Counties | 744 | 947 | 27.3% |
Cities | 8,535 | 10,647 | 24.7% |
Neighborhoods | 4,190 | 5,176 | 23.5% |
ZIP Codes | 10,205 | 12,602 | 23.5% |
Total | 24,107 | 29,906 | 24.1% |
The new valuation model and data filtering algorithms have led to a restatement of past values for the ZHVI. Figures 1 and 2 compare the ZHVI for the old versus new methodology for the US and for the Composite 20 metropolitan markets. The revision of the ZHVI has generally raised the overall estimated median value of homes. The national median home value is higher by 3.2%: $168,000 versus $162,800 with the previous methodology. The increase is due to better accuracy of the new valuation model and better screening of transactions that are normally excluded from the ZHVI (e.g., foreclosures and foreclosure re-sales).
Figure 1: Comparison of new and old ZHVI methodologies (U.S)
Figure 2: Comparison of new and old ZHVI methodologies (Composite 20)
Qualitatively, the restated ZHVIs are similar to the ZHVIs calculated using the previous methodology, although they have less volatility, particularly on a shorter time scales. The size and direction of the revision depends very much on the region, although the ZHVI year-over-year change is typically revised downward for areas that have experienced high price appreciation. For example, the November year-over-year change for Phoenix has been revised downward from 19.4% to 15.0%. In addition to revisions due to the more accurate valuation method and improved transaction filtering, the new approach to correcting residual systematic error also contributes somewhat to the restatement of the index levels.
The methodology improvements released with the November 2013 data were based on three main areas. First, the ZHVI has been rebased on the latest valuation model that produces the Zestimates. Second, new and improved filtering algorithms were incorporated to screen out bad transactional data. Third, the approach to estimate the systematic error correction was updated. These are discussed below.
More Accurate Valuation Model: The November 2013 ZHVI release incorporated the latest version of the home valuation model, which is the model that produces the Zestimate. This latest version resulted in a significant increase in the accuracy of the Zestimate (which is 13% more accurate than a year ago). Accuracy was especially improved for high-end homes (30% improvement), waterfront homes and homes in less urban areas. The new valuation model resulted in moderate revisions to the national ZHVI, resulting in a small increase the overall level of the ZHVI and somewhat damped peak-to-trough cycles.
Improved Transaction Filtering: The November 2013 ZHVI also took advantage of improved filtering on transactional data. This change impacts the ZHVI indirectly through a corresponding improvement in the valuation model (see above) and more directly through more accurate correction for residual systematic error. The valuation algorithm and the ZHVI exclude transactions that are not representative of what is considered a full-value, arms-length transaction between a buyer and a seller. This definition excludes transactions such as foreclosures, foreclosure re-sales, estate sales and intra-family transfers. In doing a better job of identifying these transactions, the ZHVI has increased on a national basis as well as in many regions. For example, the current ZHVI for Sacramento is $300,000 under the revised methodology versus $284,500 under the previous methodology, a 5.4% increase in level.
Systematic Error Correction: The systematic error correction is based on comparing the transactions versus the Zestimate for a time period. Since transactions are relatively sparse, particularly in smaller geographic regions, the new systematic error correction method smooths the bias over time and shrinks the estimate towards zero. The smoothing procedure is based on fitting a natural cubic spline with knots evenly spaced every twelve months. Specifically, the smoothed value is given by the predicted value from the model.
For regions with fewer than 100 transactions in a time period, the resulting smoothed estimate of bias will be shrunk towards zero.
New coverage of ZHVI (green) in addition to the old coverage (blue) is shown in the interactive map below:
Henderson, R. (1916). Note on Graduation by Adjusted Average. Transactions of the American Society of Actuaries, 17, 43-48.
Cleveland, R.B, Cleveland, W.S., McRae, J.E., and Terpenning, I. (1990). STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, 6, 3–73.
Dorsey, R.E., Hu, H., Mayer, W.J., & Wang, H. (2010). Hedonic versus repeat-sales housing price indexes for measuring the recent boom-bust cycle. Journal of Housing Economics, 19 (2), 75-93.
Updated on January 3rd, 2014 by Andrew Bruce