Zillow Tech Hub

Airflow at Zillow: Easily Authoring and Managing ETL Pipelines

As one of the essentials serving millions of web and mobile requests for real-estate information, the Data Science and Engineering (DSE) team at Zillow collects, processes, analyzes and delivers tons of data everyday. It is not only the giant data size but also the continually evolving business needs that make ETL jobs super challenging. The team has been striving to find a platform that could make authoring and managing ETL pipelines much easier, and it was in 2016 we met Airflow.

What is and Why Airflow?

Airflow, an Apache project open-sourced by Airbnb, is a platform to author, schedule and monitor workflows and data pipelines. Airflow jobs are described as directed acyclic graphs (DAGs), which define pipelines by specifying: what tasks to run, what dependencies they have, the job priority, how often to run, when to start/stop, what to do on job failures/retries, etc. Typically, Airflow works in a distributed setting, as shown in the diagram below. The airflow scheduler schedules jobs according to the schedules/dependencies defined in the DAG, and the airflow workers pick up and run jobs with their loads properly balanced. All job information is stored in the meta DB, which is always updated in a timely manner. The users can monitor their jobs via an Airflow web UI as well as the logs.

Why do we choose Airflow? Among the top reasons, Airflow enables/provides:

Airflow is Even More Fantastic at Zillow

Airflow makes authoring and running ETL jobs very easily, but we also want to automate the development lifecycle and Airflow backend management. The goals are:

The challenging goals above are achieved by playing Airflow together with several of the most popular and cutting-edge techniques, including but not limited to Amazon Web Service (AWS), docker and Splunk. Below is a diagram that shows how an Airflow cluster works at Zillow’s DSE team, and the interpretation follows immediately.

To smooth the entire development lifecycle, we created three independent Airflow environments, which are essentially three separate Airflow clusters:

The Current and Future of Airflow at Zillow

Since we created the first data pipeline using Airflow in late 2016, we have been very active in leveraging the platform to author and manage ETL jobs. People love it, as they found that everything becomes so easy to simply plant their jobs on this fantastic platform, comparing with creating and managing self-scheduled standalone services. As at the time of writing, Airflow at Zillow is serving around 30 ETL pipelines across the team, and processing a giant size and numerous categories of data every day.

In light of the rapid growth of usage during the past few months, it can be expected that Airflow will take care of more and more ETL jobs, and potentially expand to more teams that are dealing with data everyday. We are also working on sharing Airflow job information with other systems, as an effort to build a seamlessly connected ecosystem for Data Science and Engineering.

Exit mobile version