Have questions about buying, selling or renting during COVID-19? Learn more

Zillow Tech Hub

Automated Testing and the “Zon” Service

At Zillow, we like to release code as often as possible, because this lets us iterate quickly and get bug fixes and improvements to our users sooner.  Need a landing page because the president of the United States called and your CEO will be moderating a video chat with him in a few days?  You’d better be able to ship code quickly.  To support our fast releases, we need a testing system that can quickly gauge the quality of any given release.  This is easy with a small team, but becomes more difficult as one continues to grow and add developers. More devs means more changes merging together at an ever-increasing pace.  To ship safely, we need to be able to quickly determine if all this newly-integrated code is behaving in the way we expect it to.

Automated testing is certainly part of the answer, but we need to compile results from different testing frameworks into a single source that can quickly show us the quality of a piece of code.  Enter “Zon.” If you’ve read some other posts on the Zillow Engineering Blog, you’ll know that we like Z names.  The name “Zon” comes from Z-On.  It tells us when services are up or on. Ideally, all the time!

Zon contains a handful of predefined tests.  One of these does simple validation of an HTTP response.  This can be used in two ways.  First, to verify a page is returning expected content in the response.  For deeper tests, a service under test exposes an HTTP endpoint that triggers test code in the service itself.  The response contains either a success message or details on the failed test.  We also have a predefined test that tracks statistics from a private WebPageTest instance and another that runs nose tests. Being able to see results from all theses sources together is really useful.

Adding a test happens in two steps. First, the service under test must register with Zon. This happens via a hook on deployment. If a properly formatted Zon configuration file exists in the service’s deployment package, the service and host combination will be registered with Zon when the service is started after deployment. Second, the service must publish the list of tests it wants performed. A simple test to check if a page is available looks like this:

{
  "Homepage": {
    "type": "simple_http",
    "description": "Fetch the Zon homepage",
    "production_safe": true,
    "paths": [
      "/"
    ]
  }
}

This is a basic test that Zon defines for itself. It uses the “simple_http” predefined type. The “paths” key tells the test which paths of the webservice to check for healthy HTTP status codes. This test also uses “production_safe” to assert it can run in the production environment. Some tests permanently alter data or cause load that we don’t want in a production environment, so this option defaults to false. These self-tests for Zon serve the dual purpose of providing an example for developers to work from when creating new tests and verifying that Zon itself. If you’re adding a test to a service that already uses Zon, it’s as simple as adding a new test definition like the one above to the service’s list of tests.

Zon includes tools to check the format of the test list and run any test without logging results. This can all be done on a development machine to give developers confidence their tests will behave as intended the next time their service is deployed.

The graph above shows test failures for one feature for the previous five days.  Vertical red lines indicate deployments to this environment.  We expect deployments to cause some test failures.  The left side of this graph tells a story of a deployment that broke about half of the tests and a few subsequent deployments that corrected the issue.

For web services, Zon tests each host of a pool individually.  The failure pattern can provide insight into the nature of the failure.  Are all the hosts failing a particular test right after a deployment?  It’s probably a code-level issue.  Is only one host failing?  Time to check that host for a misconfiguration or hardware issue.  This helps separate the signal from the noise in our non-production environments where outages are more common.

One of our initial goals was to be able to verify a full deployment in 10 minutes, covering all the most important functionality.  We would run these tests continuously to have an up-to-date picture of overall quality across a diverse set of services.  More expensive and longer running tests would not be run for single changes. In practice, we exceeded that 10-minute mark, and a segment of our Selenium-based tests are running every 30 minutes to manage the load on our Selenium cluster.  We’re in the process of moving what we can to PhantomJS, which is less resource-intensive.

Zon keeps detailed test results in a fixed-size data store that is used like a ring buffer.  Older detailed test results are automatically overwritten as new results are created.  In addition, every test failure is logged to Graphite.  Graphite’s rich set of functions make it easy to expose trends and display the data in a meaningful way. There were a few lessons learned along the way.  We ran into some performance degradation with Graphite when asking it to aggregate data from a large number of different metrics on the fly.  There are two suggested ways to address this:  Graphite can handle the aggregation for you via its configuration, or your code can accumulate combined statistics in separate aggregation metrics. We chose the second approach.  Unlike most of our other services, our Graphite configuration isn’t stored in Git.  We wanted to be able to iterate on the details of the aggregation, so it made more sense to change Zon’s code to also log to an aggregation metric when it logged a failure.

Our first pass of Zon included a view that showed a service’s health across all environments it was deployed to on a single graph.  We eventually removed this view because it just wasn’t as useful as we hoped.  These different environments represented different levels of code maturity and in some cases codebases from different sprints.  A spike of errors in that graph didn’t help a user pinpoint the issue.  Knowing “Something somewhere isn’t working, unless you just had an outage from deploying code” wasn’t all that helpful.  This led us to remove that view and also start showing deployment events on the graph in other views.

We’re happy with the data Zon is providing us now, and will be building on it in the future. One possibility is to create automation that can take hosts failing tests out of a production pool automatically, with a few safeguards.  Zon will undoubtedly evolve as we see continue to refine it and see what works and what doesn’t.  Overall it’s a welcome addition to our toolbox and helps us meet our goal of shipping quality software as fast as humanly possible.

Automated Testing and the “Zon” Service