Have questions about buying, selling or renting during COVID-19? Learn more

Zillow Tech Hub

How Zillow Validates Public Record Addresses

Ensuring correct home addresses at scale

Public record data — particularly assessment data -— is the backbone of Zillow’s living database of homes. If you want to get an inventory of each and every property (including residential, commercial, vacant lands) in the nation, you would begin with the assessment data. It is the starting point for the innumerable insights Zillow provides on a home, including the Zestimate. We can associate a property with transactional data to determine how much it last sold for and when. We can link the home to our MLS feed data to know as soon as it goes up for sale. Thanks to assessment data, you’ll likely be able to find information on any homes in your neighborhood — even if they haven’t been sold in decades!

With data on each of the 110 million homes in the United States, we go to great lengths to ensure that this data is accurate and up to date. Unfortunately much of this data is error prone and inconsistent. In this blog post, we will review the significance of assessment data and go into detail on the system we built to prevent thousands of inaccurate records from contaminating our front-facing data.

Assessor Data

Each county in the US has an assessor’s office that conducts inspections on each property in its region. When the county sends an assessor to a home, they inspect the property and collect data on its features such as number of bedrooms, bathrooms and square footage. The data collected by the assessor is used to estimate both property value and tax implications.

Additionally, this data is regional and includes address, city, zip code, etc. Since much of this data is typically entered manually, it is prone to human error. Examples of frequent errors include:

  • Zip codes with missing digits
  • City names with inconsistent hyphenation
  • State abbreviation typos (e.g. a home located in “AK” could be mistyped as “AL”)

The example below shows a common instance where the city name “Easton” was typed incorrectly.

As a result of the county’s error, consumers searching for homes in one region could be viewing homes in a completely different region. If left unmonitored, this data could potentially make it to Zillow.com and misdirect a consumer’s search to an incorrect listing. Giving our consumers inaccurate data puts us at risk for losing their trust in our credibility as an accurate source of real estate information.

Our Solution

In order to help us detect and fix faulty property data that goes to Zillow.com, we developed the Assessment Validation System (AVS).

The AVS takes in large sets of property assessment data as input, compares it against a national Geographic Information System (GIS) database, and flags any incorrect addresses so that our team of analysts can either fix or omit the data before it reaches our customers.

The AVS system consists of the following components:

  1. A central GIS database containing regional data
  2. A daily Spark job that loads this data to…
  3. A DynamoDB database
  4. An API to read results from this data
  5. API consumers to flag incorrect data

Let’s break down each component of the system.

Source Data: GIS regional database

Our system validates each assessment record on the following address components: state, city, zip code, and FIPS (a unique county identifier code). These fields were chosen because they were found to have high degrees of mismatches. 

To determine what is a “correct” address combination, we must start with a central source of truth. Zillow’s GIS team manages a vastly comprehensive database of geospatial data acquired through various sources like the U.S. Census Bureau, USPS, and county GIS data. We use this data to find every “pair” combination between city, state, FIPS, and zip code. 

For example: for the city of Beverly Hills, we know that a valid pair of city and state are (Beverly Hills, California), and a valid pair of city and zip code are (Beverly Hills, 90210). We consider any address that does not match one of these pairs as an error to fix.

Loading GIS Data to DynamoDB

The central GIS database currently has more than 500,000 regions (and counting!). To prevent unnecessary load on their database, we grab a copy of the data, perform several transforms on it, and write these records to a DynamoDB instance. We chose DynamoDB because we needed this data to be in an extremely fast, read-heavy, key-value data store.

The GIS data is regularly exported to Zillow’s data lake as parquet files in S3. We created a table schema in AWS Glue for these parquet files. This schema allowed us to use Athena for more accessible querying and analysis. We used this schema to write a Glue ETL job to load and transform this data for simplified lookup. Since the source data is updated regularly, we run this Glue job daily to keep our DynamoDB records up to date.

Sample region data

All regions associated with the city of Irvine, CA

Transformed data in DynamoDB

What we end up with is a list of every known permutation of cities, zip codes, states, and counties: approximately 500,000 total permutations.

API to determine address validity

Once this data is loaded to our database, we use an API to read from it and return meaningful results. We wrote a small python app to run in AWS Lambda behind an API gateway as a simple, yet highly scalable solution. Thanks to the incredible performance of these AWS resources we can return thousands of address validation results in subseconds, at scale.

Sample error indicating that Beverly Hills is not in the state of New York

Consumers: Harmony, Juxtapose reports

Our existing data pipeline applications (Harmony and Juxtapose) call our API to validate large batches of assessment records. We flag any properties that have questionable data — anything that doesn’t match our GIS dataset — and our team of analysts make corrections as needed. From here, our data can continue to downstream data pipelines to eventually make it to Zillow.com.

Impact

In a sample over the last three months, the AVS withstood substantial load and identified numerous mismatches.

  • Validated over 11,000,000 parcels, from 800 different counties
  • Over 4 million record errors found:

Note: parcels can contain more than one mismatch.

We can only fix problems we can identify, which is what makes the AVS system so powerful. It enables us to proactively catch incorrect data and fix it before it ever reaches our customers. Our mission is to supply our customers with the highest quality data to better inform their decisions around where they call home.

Credits

Thomas Baker: Principal Software Engineer

Preethi Manoharan: Senior Product Manager

Patrick Montgomery: Senior Group Manager

Neha Thapak: Software Development Manager

How Zillow Validates Public Record Addresses