Zillow Home Value Index: Methodology

Posted by: Andrew Bruce    Tags:  , ,     Posted date:  January 3, 2014  

Introduction

In setting out to create a new home price index, a major problem Zillow sought to overcome in existing indices was their inability to deal with the changing composition of properties sold in one time period versus another time period. Both a median sale price index and a repeat sales index are vulnerable to such biases (see the analysis here for an example of how influential the bias can be). For example, if expensive homes sell at a disproportionately higher rate than less expensive homes in one time period, a median sale price index will characterize this market as experiencing price appreciation relative to the prior period of time even if the true value of homes is unchanged between the two periods.

The ideal home price index would be based off sale prices for the same set of homes in each time period so there was never an issue of the sales mix being different across periods. This approach of using a constant basket of goods is widely used, common examples being a commodity price index and a consumer price index. Unfortunately, unlike commodities and consumer goods, for which we can observe prices in all time periods, we can’t observe prices on the same set of homes in all time periods because not all homes are sold in every time period.

The innovation that Zillow developed in 2005 was a way of approximating this ideal home price index by leveraging the valuations Zillow creates on all homes (called Zestimates). Instead of actual sale prices on every home, the index is created from estimated sale prices on every home. While there is some estimation error associated with each estimated sale price (which we report here), this error is just as likely to be above the actual sale price of a home as below (in statistical terms, this is referred to as minimal systematic error). Because of this fact, the distribution of actual sale prices for homes sold in a given time period looks very similar to the distribution of estimated sale prices for this same set of homes. But, importantly, Zillow has estimated sale prices not just for the homes that sold, but for all homes even if they didn’t sell in that time period. From this data, a comprehensive and robust benchmark of home value trends can be computed which is immune to the changing mix of properties that sell in different periods of time (see Dorsey et al. (2010) for another recent discussion of this approach).

For an in-depth comparison of the Zillow Home Value Index to the Case Shiller Home Price Index, please refer to the Zillow Home Value Index Comparison to Case-Shiller

Each Zillow Home Value Index (ZHVI) is a time series tracking the monthly median home value in a particular geographical region. In general, each ZHVI time series begins in April 1996. We generate the ZHVI at seven geographic levels: neighborhood, ZIP code, city, congressional district, county, metropolitan area, state and the nation.

Underlying Data

Estimated sale prices (Zestimates) are computed based on proprietary statistical and machine learning models. These models begin the estimation process by subdividing all of the homes in United States into micro-regions, or subsets of homes either near one another or similar in physical attributes to one another. Within each micro-region, the models observe recent sale transactions and learn the relative contribution of various home attributes in predicting the sale price. These home attributes include physical facts about the home and land, prior sale transactions, tax assessment information and geographic location. Based on the patterns learned, these models can then estimate sale prices on homes that have not yet sold.

The sale transactions from which the models learn patterns include all full-value, arms-length sales that are not foreclosure resales. The purpose of the Zestimate is to give consumers an indication of the fair value of a home under the assumption that it is sold as a conventional, non-foreclosure sale. Similarly, the purpose of the Zillow Home Value Index is to give consumers insight into the home value trends for homes that are not being sold out of foreclosure status. Zillow research indicates that homes sold as foreclosures have typical discounts relative to non-foreclosure sales of between 20 and 40 percent, depending on the foreclosure saturation of the market. This is not to say that the Zestimate is not influenced by foreclosure resales. Zestimates are, in fact, influenced by foreclosure sales, but the pathway of this influence is through the downward pressure foreclosure sales put on non-foreclosure sale prices. It is the price signal observed in the latter that we are attempting to measure and, in turn, predict with the Zestimate.

Market Segments

Within each region, we calculate the ZHVI for various subsets of homes (or market segments) so as to afford greater insight into what is happening in a particular market. All market segments are shown in the table below. Only residential properties are included in the ZHVI calculation. Non-residential properties, such as office buildings, shopping centers and farms are not included.

One very useful form of market segmentation that we produce is based on the distribution of home values within the metropolitan area. Here we assign properties into one of three tiers based on their Zestimates on a particular date: top, middle or bottom tier. The thresholds for the price tiers vary from metro to metro and are determined by the distribution of home values in each metro. Since Zestimates are time-dependent, a property may belong to different price tiers at different dates. To reduce tier switching, we exclude properties near the boundaries of price tiers when assigning tiers. Thus, the sum of Zestimates in all three tiers does not equal the number of Zestimates for the “All Homes” market segment.

Table 1: Market Segments for Zillow Home Value Index
Market Segment Number of Zestimates Description
All Homes 87.3 M Single family + condominium + cooperative
Single Family 78.1 M Single family only
Condo 9.2 M Condominium + cooperative only
0 or missing 31.6 M 0 Bedroom or missing
1 Bedroom 1.7 M 1 Bedroom
2 Bedroom 11.1 M 2 Bedroom
3 Bedroom 28.6 M 3 Bedroom
4 Bedroom 11.7 M 4 Bedroom
5+Bedroom 2.7 M 5 Bedroom or more
Top Tier 27.0 M Top price tier among homes within the same metropolitan
Middle Tier 27.0 M Middle price tier among homes within the same metropolitan
Bottom Tier 27.0 M Bottom price tier among homes within the same metropolitan

 

Methodology

Using the estimated market value of every home as represented in the Zestimate, the main steps in the construction of the ZHVI are as follows:

  1. Calculate Raw Median Zestimates
  2. Adjust for Any Residual Systematic Error
  3. Apply Henderson Moving Average Filter
  4. Apply Seasonal Adjustment
  5. Final Quality Control

 

Calculating Raw Median Zestimates

Let t be a discrete independent time variable with a value at the end of each month. Let H(t) be an M by N matrix with each element hij(t) representing the number of homes at time t for the i-th market segment in the j-th geographical region, where M is the total number of market segments and N is the total number of unique regions having a minimum required number of Zestimates. Currently, we have M=12 and N=77,590. Geographical regions include national, state, metro, county, city, ZIP code and neighborhood. The Number of Zestimates column in Table 1 above represents the number of homes in the i-th element of hij when j=’National’ and t=’Nov-2013’.

Let zij(t) be the vector of Zestimates of all homes at time t having length hij(t) for i-th market segment and j-th region. The raw median Zestimate, rij(t), for i-th market segment and j-th region is defined as:

rij(t)=Median(zij(t))

rij(t) is the median Zestimate and is an element of the M by N matrix R(t). In order to ensure reliability and stability, we only compute rij when hij(t) is above some minimum threshold. For November 2013, there are a total of 389,451 market segments by regions for which the median could be computed:

Count{rij(t) ≠NA, for i=1,..M and j=1,..N} is 389,451.

Table 2: Number of regions by market segment having raw median Zestimates
Market Segment National State MSA County City Neighborhood Zip Total
All Homes 1 51 917 2,830 23,057 8,664 24,460 59,980
Single Family 1 51 917 2,828 22,976 8,068 24,249 59,090
Condo 1 51 507 895 4,189 2,916 6,629 15,188
0 or missing 1 51 868 2,464 14,023 3,304 15,097 35,808
1 Bedroom 1 51 537 1,097 2,112 1,080 3,418 8,296
2 Bedroom 1 51 742 1,821 9,173 3,083 11,870 27,461
3 Bedroom 1 51 817 2,105 13,310 5,523 15,796 37,603
4 Bedroom 1 51 766 1,829 8,633 3,124 11,485 25,889
5+Bedroom 1 51 619 1,249 3,524 1,018 5,648 12,110
Top Tier 1 51 913 1,681 12,554 4,112 14,862 34,184
Middle Tier 1 51 913 1,704 14,058 4,877 16,364 37,968
Bottom Tier 1 51 913 1,676 12,941 5,119 15,173 35,874

 

Adjust for Any Residual Systematic Error

Zestimate errors are both time and region dependent. While the errors produced by the Zestimate algorithm are generally equally distributed above and below the actual sale price, there can be some residual systematic error detected once more historical sales are known (systematic error here is defined as the median raw error being slightly greater or less than zero). In this event, raw median Zestimates are adjusted through the use of a correction factor in the manner described below.

Let uij(t) be the median home value free of systematic error. Then, the raw median Zestimate can be expressed in terms of uij(t) as:

rij(t)= {1+ bj(t)} * uij(t)

where bj(t) is the systematic error in Zestimates representing the median fluctuation of Zestimates above or below the actual sold prices within the time window centered around t for the j-th region. We calculate the Zestimate systematic error as:

bj(t)= Median({zj(t-1)- sj(t)}/sj(t))

where sj(t) is a vector of sale prices and zj(t-1) are Zestimates corresponding to the same properties as sj(t) but with the estimated sale price taken from the period immediately prior to the actual sale (to ensure that the estimate has not been influenced by the sale). The vector of sales, sj(t), is obtained through the following approach:

  1. Find all sales within a 30-day window centered on t.
  2. Increase the window on either side of time t until at least 100 transactions are obtained for region j.
  3. The maximum length of the window is 365 days.
  4. For time t at the two endpoints of the time series, a maximum window length of 182 days is imposed.
  5. Fit a natural cubic smoothing spline with knots evenly spaced every twelve months to the time series bj(t) to remove noise.
  6. If fewer than 100 transactions are present, then shrink the bj(t) towards zero.

After computing bj(t), the adjusted median of Zestimates is an M by N matrix U(t) calculated as:

uij(t)= rij(t)/{1+ bj(t)}

Apply Henderson Moving Average Filter

We apply a 5-term Henderson moving average filter to U(t) to reduce noise in the data. This filter was derived by Henderson, R. (1916). The filter weights applied in the middle of a time series are symmetric, while the end filter weights are asymmetric.

MA(t) = w1U(t-2)+ w2U(t-1)+w3U(t)+ w4U(t+1)+ w5U(t+2)

where

w = (-0.07343, 0.29371, 0.55944, 0.29371, -0.07343) for the middle points: t = 3, 4, .. , TMax-2

w = (-0.04419, 0.29121, 0.52522, 0.22776, 0) for t = TMax -1

w = (0, 0.22776, 0.52522, 0.29121,-0.04419) for t = 2

w = (-0.13181, 0.36713, 0.76467,0, 0) for the end point: t = TMax

w = (0, 0, 0.76467, 0.36713, -0.13181) for the start point: t = 1

The resultant M by N matrix MA(t) is a smooth estimate of the median home value free of residual systematic error. This may not be as necessary for large regions such as the nation and states because of the large available data set, but it is applied to all levels for consistency.

Apply Seasonal Adjustment

Home sales are affected by seasons within the same year. Adjusting for seasonality is desirable so that the trend is more apparent for ease of comparison and forecasting. Since Zestimates and the ZHVI depend on sale prices, the time series MA(t) does contain some seasonality. We remove this seasonality using a seasonal-trend decomposition procedure (STL) based on the Loess method developed by Cleveland et al. (1990). STL is a filtering procedure for decomposing a time series into seasonal, trend and remainder components:

MA(t)= S(t)+T(t)+ RE(t)

where S(t), T(t) and RE(t) are the seasonal, trend and remainder components respectively. We remove seasonality by adding the trend and remainder components to form the seasonally adjusted ZHVI:

ZHVI(t)= T(t)+ RE(t)

The remainder component, RE(t), represents irregular features in the time series which we preserved.

Final Quality Control

The time series matrix ZHVI(t) has the same dimension as H(t) which is M by N (as noted, 12 x 77,590). While this theoretically could produce almost 1 million different time series, in practice many time series are eliminated because of data sparseness or temporal volatility. A four-star quality rating function is applied to every ZHVI time series. The variables feed to this function are features associated with each ZHVI time series. They include:

  1. Number of Zestimates
  2. Number of transactions in most recent three months
  3. Temporal volatility measured by annualized, monthly and quarterly change
  4. Number of outliers
  5. Gaps
  6. Jumps
  7. Disclosure or non-disclosure states

After suppressing those with star ratings below 2, there are 242,362 unique deliverable ZHVI time series for the report period ending November 2013.

Table 3: Number of deliverable ZHVI time series by region level and market segment
Market Segment National State MSA County City Neighborhood Zip Total
All Homes 1 49 485 947 10,647 5,176 12,602 29,907
Single Family 1 49 486 946 10,623 4,955 12,507 29,567
Condo 1 48 347 600 3,467 2,087 5,434 11,984
0 or missing 1 49 425 813 6,072 2,116 7,315 16,791
1 Bedroom 1 47 311 539 1,469 762 2,388 5,517
2 Bedroom 1 48 447 840 6,376 2,911 8,649 19,272
3 Bedroom 1 48 511 998 8,746 4,140 11,131 25,575
4 Bedroom 1 50 509 956 6,863 2,708 9,346 20,433
5+Bedroom 1 46 426 769 3,073 959 4,986 10,260
Top Tier 1 47 533 915 8,744 3,030 10,899 24,169
Middle Tier 1 50 493 860 9,943 3,814 11,676 26,387
Bottom Tier 1 48 486 831 7,927 3,302 9,905 22,500

 

Restatement of the ZHVI

ZHVIs for all geographic regions and market segments are updated every month. Since there is variable latency in Zillow’s receipt of transactional data from public records, Zillow’s estimate of residual systematic error can change as new transactions arrive. Historical estimates of systematic error are recalculated monthly and incorporated into revised ZHVI time series. As a result, there can be restatements of the ZHVI for up to three years from initial reporting date.

Impact of Methodology Change: November 2013

With the release of November 2013 Zillow Home Value Index (ZHVI) data, we have improved the underlying valuation model, introduced additional data filtering algorithms and developed a new approach to dealing with residual systematic error. The result of these changes led to a 24.1% increase in the number of regions for which Zillow reports a ZHVI (see Table 4). The historical values for the ZHVI have been restated with these changes, leading to slightly higher current estimate of the median home value nationally. The revised ZHVIs are qualitatively similar to the ZHVI computed using the previous methodology, although the new time series are significantly smoother.

Table 4: Increase in reporting regions by region type from the previous (Oct. 2013) and new (Nov. 2013) methodologies
Oct. ZHVI Nov. ZHVI Increase
States 44 49 11.4%
Metros 389 485 24.7%
Counties 744 947 27.3%
Cities 8,535 10,647 24.7%
Neighborhoods 4,190 5,176 23.5%
ZIP Codes 10,205 12,602 23.5%
Total 24,107 29,906 24.1%

The new valuation model and data filtering algorithms have led to a restatement of past values for the ZHVI. Figures 1 and 2 compare the ZHVI for the old versus new methodology for the US and for the Composite 20 metropolitan markets. The revision of the ZHVI has generally raised the overall estimated median value of homes. The national median home value is higher by 3.2%: $168,000 versus $162,800 with the previous methodology. The increase is due to better accuracy of the new valuation model and better screening of transactions that are normally excluded from the ZHVI (e.g., foreclosures and foreclosure re-sales).

Figure 1: Comparison of new and old ZHVI methodologies (U.S)
USA_IndexComparePriorMethodology_19960401-20131130

Figure 2: Comparison of new and old ZHVI methodologies (Composite 20)
Composite 20_IndexComparePriorMethodology_19960401-20131130

Qualitatively, the restated ZHVIs are similar to the ZHVIs calculated using the previous methodology, although they have less volatility, particularly on a shorter time scales. The size and direction of the revision depends very much on the region, although the ZHVI year-over-year change is typically revised downward for areas that have experienced high price appreciation. For example, the November year-over-year change for Phoenix has been revised downward from 19.4% to 15.0%. In addition to revisions due to the more accurate valuation method and improved transaction filtering, the new approach to correcting residual systematic error also contributes somewhat to the restatement of the index levels.

November 2013 Methodology Revision Details

The methodology improvements released with the November 2013 data were based on three main areas. First, the ZHVI has been rebased on the latest valuation model that produces the Zestimates. Second, new and improved filtering algorithms were incorporated to screen out bad transactional data. Third, the approach to estimate the systematic error correction was updated. These are discussed below.

More Accurate Valuation Model: The November 2013 ZHVI release incorporated the latest version of the home valuation model, which is the model that produces the Zestimate. This latest version resulted in a significant increase in the accuracy of the Zestimate (which is 13% more accurate than a year ago). Accuracy was especially improved for high-end homes (30% improvement), waterfront homes and homes in less urban areas. The new valuation model resulted in moderate revisions to the national ZHVI, resulting in a small increase the overall level of the ZHVI and somewhat damped peak-to-trough cycles.

Improved Transaction Filtering: The November 2013 ZHVI also took advantage of improved filtering on transactional data. This change impacts the ZHVI indirectly through a corresponding improvement in the valuation model (see above) and more directly through more accurate correction for residual systematic error. The valuation algorithm and the ZHVI exclude transactions that are not representative of what is considered a full-value, arms-length transaction between a buyer and a seller. This definition excludes transactions such as foreclosures, foreclosure re-sales, estate sales and intra-family transfers. In doing a better job of identifying these transactions, the ZHVI has increased on a national basis as well as in many regions. For example, the current ZHVI for Sacramento is $300,000 under the revised methodology versus $284,500 under the previous methodology, a 5.4% increase in level.

Systematic Error Correction: The systematic error correction is based on comparing the transactions versus the Zestimate for a time period. Since transactions are relatively sparse, particularly in smaller geographic regions, the new systematic error correction method smooths the bias over time and shrinks the estimate towards zero. The smoothing procedure is based on fitting a natural cubic spline with knots evenly spaced every twelve months. Specifically, the smoothed value is given by the predicted value from the model.

Equation1

For regions with fewer than 100 transactions in a time period, the resulting smoothed estimate of bias will be shrunk towards zero.

New Coverage of ZHVI by County

New coverage of ZHVI (green) in addition to the old coverage (blue) is shown in the interactive map below:


References

Henderson, R. (1916). Note on Graduation by Adjusted Average. Transactions of the American Society of Actuaries, 17, 43-48.

Cleveland, R.B, Cleveland, W.S., McRae, J.E., and Terpenning, I. (1990). STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Journal of Official Statistics, 6, 3–73.

Dorsey, R.E., Hu, H., Mayer, W.J., & Wang, H. (2010). Hedonic versus repeat-sales housing price indexes for measuring the recent boom-bust cycle. Journal of Housing Economics, 19 (2), 75-93.


About the author
Andrew Bruce
Andrew is the Director of Data Sciences at Zillow



Related Posts