Image Processing at Zillow: Out with the Old and In with the AWS

Feroze Daud Jan 27 2015

The old saying goes that real estate is about Location, Location, Location. We’d like to submit a new saying: online real estate shopping is about Photos, Photos, Photos. Nothing delights Zillow’s users more than big, beautiful, and bountiful photos on real estate listings. And nothing delights agents, sellers, and property managers more than the page views, leads, and traffic they get when their listings entice shoppers with great pictures!
1500x675_white_home

As Zillow’s popularity increased and agents started providing higher resolution photos on their listings, we found that our legacy imaging system could not keep up with the demand. We realized we needed to scale in a big way. Amazon Web Services (AWS) was the way to get us there.

This blog post walks you through our journey from our local data center legacy imaging system of the past to our AWS-powered imaging system for the present and future. We’ll show you:

How our legacy system was structured and what its limitations were
How we built a scalable imaging download and processing system using AWS
Some key lessons we learned through the process since we started out as AWS newbies

We also discuss a few options we’re considering for making the new system even better, so please feel free to comment with your opinions and experiences!

So How Many Images Are We Talking About?

Zillow currently processes over 3 million new images each day. This includes listing images; profile images for users, agents, and other professionals; home project images through Zillow Digs; and other assorted image types. Some are manually uploaded and some come in bulk feeds to be downloaded. The rate of new images coming from these bulk feeds can be unpredictable and is typically batched into groups throughout the day rather than an even rate.

We have a lot of turnover in our images, so we don’t end up persisting 1+ billion images (3 million * 365) each year . However, the chart below shows how many more images we have started to store as the years have passed. You can see why we needed to scale up our system!

The Legacy System

Our legacy system lived in our own hosted data centers (with the exception of a CDN for image serving). Images were downloaded out of one queue; stored on an NAS device in the “pyramidal TIFF” format; and served to the CDN out of a local Squid service.

Its primary problems were:

Not easily scalable for downloads, storage, or serving.
Could not be easily extended with algorithms to improve image quality.
Poor disaster recovery.

Here’s a diagram of our downloading and processing pipeline:

Legacy Image Flowchart

As you can see, manual uploads of images came in one path (on the right) and images provided from listings feeds were entered into another path with a queue (on the left). Because images from hundreds of feed sources were downloaded out of a single queue, we could not take advantage of the fact that some image sources allow much faster downloading with more concurrency than other sources. This caused head-of-line blocking problems (a slow or problematic image at the front of the queue would hold up other images) and the potential for overwhelming slower image sources with an image request rate that was too high (while at the same time not being able to take advantage of sources that could support that rate).

Images were stored on a large NAS device in the “pyramidal TIFF” format, which is essentially a multi-layer image consisting of progressively smaller jpegs. The concept is that you can produce any image size efficiently by opening the file, seeking to the layer holding the next-largest size, and resampling it appropriately. This avoids the expense of scaling the original image (which is possibly much larger). Our scaling was done on the fly when an image serving request came for a particular size of image. Scaling on the fly can be expensive, so we relied heavily on caching via our CDN and local Squid service. If the cache was even partially blown, recovery would overwhelm the dynamic scaling layer and the NAS device for hours.

The tool that we used for processing the image to store it in pyramidal TIFF format was aging and not easily extensible. We could not easily put in image quality improvements like removing solid color borders from images or supporting CMYK/other colorspace images.

The New AWS System

Putting images “in the cloud” may sound like a simple endeavor, but we actually had to build a substantial amount of infrastructure to end up with a system that solved the problems of our legacy imaging system.

Here’s a summary of the problems we set out to solve and how we solved them:

Disaster recovery
- Our images are now stored in Amazon’s S3 service, which guarantees many “9s” of durability and availability.
- Our image downloading and processing system runs in AWS using Elastic Beanstalk (ELB), which takes advantage of multiple “availability zones” for disaster recovery.
Not easily scalable for downloads, storage, or serving
- Image downloads from each feed source are now independent – we can take advantage of sources that support high bandwidth and concurrency while throttling back for sources that cannot.
- Image downloading and processing scales to handle varying levels of incoming images throughout the day.
- S3 provides unlimited storage without having to order and install more servers or drives.
- Amazon CloudFront and S3 take care of our serving scalability needs.
Could not be easily extended with algorithms to improve image quality
- Our image processor is now built on the Python Imaging Library (PIL). We have found this easy to extend.

Here’s a diagram of the system that we built. The individual services are described below.

AWS Image System

Download Server (DLS): This is a service in Zillow’s data center that manages image download requests coming from listing feeds. This includes ensuring that our database of images (which also lives in our data center) is kept up-to-date with the results of the cloud image processing.

Beanstalk Rest API: This service is just a front-end in the cloud for the DLS. It takes an image download request and puts it in one of the per-feed SQS queues. We have one SQS queue per feed solve the head-of-line blocking problem.

Throttled Downloader: The throttled downloader does the actual downloading of the image. It controls the rate and concurrency at which we download images for each feed source, allowing us to take advantage of providers that support fast downloading and to not deluge providers that don’t. If the image download succeeds, we write the original image into S3 to be used in the image processing step. If the image download fails, a message is queued into the failure queue to be processed by the DLS.

Image Processor: This is a beanstalk worker role deployment. It runs the Python Imaging Library with some custom code. We take the original images stored in S3 and process them through various image quality methods (such as removing solid color borders). We also generate a standard set of sizes for each image that we know we need on the site (for example: a large one for our full page photo gallery, a smaller one for our emails, etc.) and store those to S3. If the image conversion succeeds, a message is put into the Success SQS Queue. Otherwise it is put into the failure queue.

CDN: Images are served out of S3 and cached in the CloudFront CDN. We serve up to 15,000 images per second and have never had performance issues with this model.

Lessons Learned

In the process of moving over to a new imaging system and AWS, we learned some important lessons.

Measure! The initial versions of our apps did not expose enough metrics. Due to this, we could not figure out which component of the pipeline was contributing to image processing delays. We ended up having to invest a lot of time after the system was already live adding CloudWatch metrics to track latencies. Once we did this, we had a clearer picture of the causes and could take action to eliminate them. In some cases, we could use the metric thresholds to drive autoscaling activities, so no human intervention was required.

Use CloudWatch Alerts. We found cases where Beanstalk containers did not behave as expected. For example, in a worker role, sometimes the system would scale down, but leave a bad instance (one that failed early in the initialization phase) in service. The bad instance would never pick up any work. And since we used CPU thresholds to add/remove machines, the autoscaling group would never do its job because the CPU of the bad instance was always at zero. We could detect this by adding alerts on queue levels. We also added alerts to other parts of the pipeline which allowed us to closely monitor the SLA of the various components.

Future Work

We are continuously tuning the new system for best performance, lowest cost, and most functionality. Here are a few areas we want to tackle in the future:

Given the bursty nature of how images come into our system, the autoscaling configuration of the Image Processor is tricky. Spin up too many instances and you’ll waste money, but too few and you’ll add delay to the images making it to the site. Right now we are scaling on the latency between the time an image is enqueued to when it is finished processing. In the future, we may switch to custom scaling based on queue length. We may also investigate solutions to preemptively scale up some Image Processors before times that we know large batches of images tend to come in.
In the new system, we gave up the flexibility of our pyramidal TIFF scaling and went to pre-rendering all image sizes. If the design of the site changes and we need a new image size, it is a real pain to re-process hundreds of millions of images to support that new size. We would like to switch to a just-in-time model where scaling doesn’t happen until the first request. Perhaps we could use GPU instances for scaling or the newly launched event-driven AWS Lambda?
PIL is dead; long live Pillow. PIL is deprecated and Pillow is the project being actively developed. We need to switch over to take advantage of the open source community’s continuing contributions to image processing.
Improving image quality. We try to take the images we’re given and make them the best version of themselves. We’ve experimented with sharpening and do solid color border removal, but there are many other areas we could investigate for the future.

Conclusion

Sometimes an old system is worth maintaining, but other times you’re better off throwing it away and starting from scratch. Given the importance of images to the Zillow user experience, we wanted a system that could grow to match our needs. By moving to AWS (S3, CloudFront, Elastic Beanstalk, and SQS) we no longer have to worry about forklift upgrades, cache flushes, or capacity. The move was not without its challenges (in large part because we had very little prior AWS experience), but we are happy with the new system and eager to continue improving it.

Homes for sale

Resources

Discover rentals

Your search

Your rental

Resources

Resources

Selling options

Looking for pros?

I'm a pro

Rental Management Tools

Learn More

Image Processing at Zillow: Out with the Old and In with the AWS

So How Many Images Are We Talking About?

The Legacy System

The New AWS System

Lessons Learned

Future Work

Conclusion

Image Processing at Zillow: Out with the Old and In with the AWS

Read Next

Featured

Recent

Homes for sale

Resources

Discover rentals

Your search

Your rental

Resources

Resources

Selling options

Looking for pros?

I'm a pro

Rental Management Tools

Learn More

Image Processing at Zillow: Out with the Old and In with the AWS

So How Many Images Are We Talking About?

The Legacy System

The New AWS System

Lessons Learned

Future Work

Conclusion

Image Processing at Zillow: Out with the Old and In with the AWS

Zillow’s SkyTour is pushing the tech boundaries of real estate visualization

Behind the Innovation: How Zillow Showcase is Transforming Home Shopping

Zkafka by Zillow: Go Library for Simplifying Kafka Consumption

Koios: A lightweight workflow service combined with server-driven UI for a better customer experience

Read Next

Featured

Recent