Zillow Tech Hub

Solving the Challenges of Public Records Data

Zillow has a living database of more than 110 million U.S. homes, including homes for sale, homes for rent and homes not currently on the market. The database is built from a range of disparate sources, incorporating streams of county records, tax data, listings of homes for sale, listings of rental properties and mortgage information. Zillow offers advanced statistical predictive products, including the Zestimate®, the Rent Zestimate and the ZHVI® family of real estate indexes. The Zestimate is an estimate of the value of over 100 million homes and is updated every day. Data Quality is a huge factor in creating accurate estimates. In this article we will discuss the data quality challenges with public records.

Data Completeness

Data Completeness is primarily an issue from the data originator (County). Someone collect this data individually from each county. Collecting the data varies by counties, some counties provide the data on-line, some counties have online access to images of deed documents which needs to be keyed in by a person, and some counties deliver documents in person. Data might be missing because counties did not collect all the required data or errors were made during the key-in process.

In some counties transactions are not available due to non-disclosure. Non-disclosure states do not require to record the transactions and cannot be distributed to public.

We use these metrics to understand the data completeness.

Attribute Completeness

The data sciences team identified the crucial fields for the Zestimation process. For each field fill rates are calculated based on the home type by County. For example, Lot Size for condo’s not valid so we ignore that values in the calculation.

FillRate = # of of the properties with field available/Total # of Properties

Transaction Completeness

Calculate the monthly transactions count by county. Using the market conditions, property count, and transaction count trend over last five years we will determine the transaction completeness.

Data Latency

Yearly Tax assessments and Property Transactions are an important data for web users and also an input to our valuation models. Getting the data on timely manner pose a huge challenge. Latency can be introduced at the County, or data provider that collects the data from county or data ingestion team. Latency introduced at the data provider and data ingestion can be addressed easily. Latency introduced because Counties take time to collect and report data is a huge challenge.

We use following metrics to understand the Data Latency.

Assessment Latency

Assessment Latency is the difference between current year and recent assessment year received from the county.

Transaction Latency

Transaction latency is defined as difference between the transaction recorded date and transaction received in our database.  For each county we calculate the median latency.

TransactionLatency = Median (Transaction Recorded Date – Transaction Received Date)

Data Interpretation

Data is not standardized across all the counties because they seldom use the exact system to track their property records. Data providers attempt to standardize the data into one format. During this process some of the data might be lost or standardized in correctly. We have seen cases where data provider incorrectly interpreted the SQFT to include the unfinished basement and created huge problems downstream.

About the author

The Zillow Editorial Team is dedicated to empowering home buyers, sellers, and renters with expert insights, data-driven research, and practical advice. Our team of housing market analysts, real estate professionals, and writers brings you the latest trends, tips, and tools to help you navigate every step of your real estate journey. Whether you're exploring neighborhoods, estimating home values, or planning your next move, we provide the knowledge you need to make informed decisions with confidence.
Exit mobile version