TL;DR: Sometimes, writing an intermediate result to disk and then reading it back again is faster than checkpointing or unoptimized caching!

At Spokeo, we use Spark on Amazon EMR to build data from our data lake into documents that can be consumed by our backend services and eventually shown in the frontend. Therefore, many of our ETL jobs are concerned with producing one result at the end of a script. Getting to the end may involve many heavy calculations, e.g. windowing on id, before joining back to the main dataframe. …

Spokeo Engineering

Spokeo is a people search engine! We’ve organized over 12 billion records from thousands of sources into easy-to-understand profiles.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store