Last month, a few members of the Data Science and Engineering team had the opportunity to share how we use Spark at Zillow with the Seattle Spark Meetup group. At Zillow, our journey with Spark began a little over a year ago (with the nascent 1.2.1 release). In the time since, we not only watched the cluster computing framework grow in stability and features, but witnessed our own development as Spark users. This presentation is a culmination of that year of unearthing Spark’s intricacies as well as the general do’s and don’ts of everyday use.
Briefly, I will summarize the major talking points from our presentation, which covered three topics: the data lake, user segmentation, and the Zestimate®.
Data Lake
I kicked off the presentation by walking through the company-wide data lake initiative that our team drove. Importantly, adopting Spark meant more than simply switching to a new parallelization framework: We wanted raw data readily available in the cloud to facilitate rapid prototyping of new machine learning applications. I then introduced the first component of such an application: the Historical Data Storage, a Spark job responsible for maintaining a history of property and user data to be consumed by the Zestimate. This job dedupes records, partitions the data for downstream services, and standardizes the output data format. The challenges I faced when writing this application revolved around ingesting and storing data.
User Segmentation
Next up, Alex Chang discussed how we use Spark to tackle user segmentation at Zillow. As an organization, we always strive to delight our users by improving their home-finding experience. With user segmentation, we can accurately predict which persona typifies a user, allowing us to deliver an individualized experience. Of course, models of this kind require rich data often massive in size and for that reason are perfect candidates for Spark. Alex shared his lessons learned in building this application: tweaking a broadcast join parameter, the differences between narrow and wide transformations, and the importance of mindful caching.
Zestimate
Finally, David Fagnan compared how we created Zestimates before Spark and now with it. In both cases, the partitioning strategy remains the same, insofar as the machine learning models powering the algorithm require data to be partitioned by region. The programming language used was the big change between the iterations. Before Spark, we wrote everything in R, including a homebrew parallelization framework powering the Zestimate. When work first began on the new iteration, we deemed SparkR to be in its infancy and settled on PySpark instead. Still, we did not want to throw our entire R codebase away and opted to use the rpy2 Python library as the interface layer between PySpark code and machine learning models written in R.
You can click through the presentation we delivered below.