Improving the home selling & buying experience by containerizing ML deployments

Alex Latchford Sep 26 2019

With Zillow Offers we’re transforming how real estate is bought and sold. Underpinning it is a process we follow that ensures every seller and buyer receives a delightful and consistent experience. Unsurprisingly this process is ripe for automation and our approach on the Zillow Offers Machine Learning team has been to develop a containerized platform for rapidly developing, validating and deploying models in response to evolving business requirements.

Rationale for developing a platform

After deploying our first two models using Zillow’s existing AWS EC2-backed serving architecture, we quickly realized the difficulties our serving tier was going to face. A given model might be called only once per offer request, while a different model might be called thousands of times per minute. Balancing cost, availability and burst scalability quickly led us to investigate an alternative approach.

Our initial software architecture utilized a monolithic design, which allowed us to iterate quickly without needing to provision new hardware. However, as new models were added to it, dependency management became a complex issue. Deploying each model separately, encapsulating dependencies, and a requirement to ensure reproducibility of our predictions drove the need for us to adopt containers.

Taking a step back

Once we identified these challenges we wanted to approach the solution in a way that wouldn’t require costly re-tooling or re-training in the near future. A key step here was agreeing on nomenclature to describe the key concepts and components. We went with:

Project – A business problem. Each project resides in its own Git repo which is primarily broken into three areas: experimental research as Jupyter notebooks, production modelling code, and any supporting data pipelines or required workflow definitions.
Variant – A specific implementation of a solution to a project’s business problem described by a project. Also where the model training and serving code exists.
Endpoint – A schema exposing a variant’s solution. Defined as an API contract for a consuming client team or a debug method for introspection.A project can be implemented by multiple variants (even at the same time via A/B, bandits, etc.) and surfaced to downstream consumers via many endpoints.

Each project can have multiple variants, with each variant supporting multiple endpoint interfaces for real-time scoring and debugging.

Turning a vision into components

With the concepts described above we needed to turn them into a repeatable pattern. We turned to cookiecutter, a utility for building templatized projects, to provide us with a design pattern for uniformly spawning new repositories. The benefits of this approach were the ease of generating new projects, and the ability to customize them. However, rolling out updates to the pattern itself can be tricky and requires a great deal of attention. We simplified our process by devising a set of internal libraries and tooling shared across projects, centralizing code that performs repetitive heavy lifting and abstracting away much of the plumbing between components. This allowed model developers to focus on tasks they’re best at!

Two of these tools are: Pillar and Scorcerer. Pillar encapsulates utilities to construct training workflows, handles our code variant structure and ingests hyperparameter and training data configurations for jobs executed in AWS SageMaker. Scorcerer centralizes tooling for exposing endpoints for our consuming clients. Scorcerer is infamous internally for mispronunciation but it provides a magical experience and with new services onboarding is here to stay!

When built, each project packages up all its dependencies into a single Docker image and exposes multiple entrypoints which are called during steps in the training pipeline. By reusing the same image consistently between training and serving we flatten the dependency set required (both at a system and an application level), which allows us to take a model we’ve trained within SageMaker, serialize it to our Data Lake, and deploy it on Kubernetes for real-time serving. Without Docker, debugging models trained in the past was fraught with errors as often dependency sets were incompatible. Now as we train we version the serialized model and Docker image concurrently to streamline this process.

Entrypoints for a variant Docker image that’s used by pipeline steps.

The entrypoints we expose by default are:

train – A custom SageMaker entrypoint that invokes Pillar to ingest hyperparameter and data configurations from the SageMaker interface to the variant’s training code.
serve – Calls Scorcerer, which utilizes gunicorn and flask to serve a model over HTTP.
predict_batch – Provides a mechanism to run offline batch inference on a dataset.
debug – Allows engineers and applied scientists to easily jump inside a container for debugging or testing purposes.

Hooking the components together

A new project is generated once a new research area is discovered. The source code template is generated and customized using our cookiecutter-backed archetype pattern. Initial research is performed in notebooks stored within the project, and any reusable code is pulled into separate modules. We use Gitlab’s built-in CI/CD tooling to construct the pipelines that build, test, and publish our Docker images to repositories on AWS ECR. This approach has drastically simplified our route from initial research to production by maximizing code reuse between these environments.

Airflow operates as our training pipeline orchestrator by initiating one or more Spark jobs that take raw datasets from our Data Lake and transforming them into datasets tailored for training. We then feed these datasets into SageMaker as separate channels. At this point we typically complete any feature engineering required for the specific model variant.

Pillar then orchestrates training flows within the SageMaker train entrypoint, and evaluates models as they finish training. The evaluation provides us with a publishing decision for tasks downstream to interpret. We then serialize the model and save it to S3 using a pathing structure that encodes the project, variant name, and other salient metadata.

If the decision is made to publish, Airflow retrieves the variant S3 path generated in the previous step and passes it to Gitlab where another deployment pipeline utilizes Helm to perform a deployment to Kubernetes using our real-time scoring solution Scorcerer. If we fail to publish then we trigger an alert for manual review.

We log metrics for scoring requests made to Scorcerer in Datadog for debugging, and in Zillow’s Data Streaming Platform for longer term offline prediction modeling. This allows us to reconcile scoring requests against ground-truth datasets to ensure our performance correlates with actual values after the fact.

We build artifacts using Gitlab that are consumed for daily training jobs orchestrated by Airflow.

The Future

As Zillow Offers continues to roll out to additional markets we’ve seen an increase in the complexity of requirements we’re being asked to support. At the same time we’re also increasingly asked to ensure any models we produce are interpretable for our colleagues internally, as well as for our customers. The next generation of our platform will take this into consideration and is likely to focus on pipelining technologies both at training and inference time, if this sounds like something you’d be interested in helping us out with we’re hiring.

Many thanks to Steven Hoelscher, Alex Pryiomka, Ezra Schiff, Sruthi Nagalla & Taylor McKay for guidance and editorial help on this article and to the entire ZO Machine Learning team for their efforts getting us this far!

Homes for sale

Resources

Discover rentals

Your search

Your rental

Resources

Resources

Selling options

Started a loan application?

Your mortgage

Mortgage tools

Looking for pros?

I'm a pro

Rental Management Tools

Learn More

Improving the home selling & buying experience by containerizing ML deployments

Rationale for developing a platform

Taking a step back

Turning a vision into components

Hooking the components together

The Future

Improving the home selling & buying experience by containerizing ML deployments

Read Next

Featured

Recent

Homes for sale

Resources

Discover rentals

Your search

Your rental

Resources

Resources

Selling options

Started a loan application?

Your mortgage

Mortgage tools

Looking for pros?

I'm a pro

Rental Management Tools

Learn More

Improving the home selling & buying experience by containerizing ML deployments

Rationale for developing a platform

Taking a step back

Turning a vision into components

Hooking the components together

The Future

Improving the home selling & buying experience by containerizing ML deployments

Zillow’s SkyTour is pushing the tech boundaries of real estate visualization

Beyond Clicks: Designing AI-Driven User Memory for Personalization

Revolutionizing the Real Estate Experience with LLMs: StreetEasy’s AI Journey

Next Best Action platform: democratizing personalization with contextual bandits

Read Next

Featured

Recent