There is no end to the discussion about performance in big data pipelines. If you ever ask a user what they want, there are two things you are guaranteed to hear:

  1. More data (more reports, more history, more breakdowns)
  2. Lower latency

While the rest of the world is guaranteed only death and taxes, we are guaranteed these two. Unfortunately adding more data typically increases run time and then increases latency, so these two requests are directly at odds. You could potentially get both if you’re willing to double or triple your costs, but that typically isn’t an option.

When it comes to improving performance to accomplish both goals, we have a limited number of tools in our belt. They typically end up being slotted into one of two buckets:

  1. Move less data
  2. Use more threads

Option number one can be implemented in quite a few ways. First and foremost, it is okay to delete reports/dashboards in the pipeline that are no longer in use. There will almost certainly be some push back, but if you implement a policy that pipelines will be archived when they receive little or no usage, and you hold your ground, it can work. If the company pushes back on this, show them how much it affects the bill (assuming they’re using a cloud service).

Also in the ‘move less data’ option is implementing incremental updates to large tables. This always carries some risk of inaccuracies. Even the most popular commercial applications tend to be inconsistent on tracking updates in a reliable fashion (never ever believe a ‘last_updated’ field to be accurate). When data was more expensive and less ubituitous a common strategy was to incrementally refresh during the week and do full refreshes on weekends. This is less common now but a perfectly valid implementation to maximize latency and spend.

Another ‘move less data’ option is limit the length of history in your reports. I’m constantly surprised at the insistence of using data more than three years old in reporting. There may be some value, but the most recent year’s data has high value, and that six year old data isn’t likely to give much insight.

My favorite tool in the tool box is using more threads. Data applications are somewhat uniquely suited to high levels of parallelism. I was working on a data pipeline recently that had about 2000 unit tests and the lead on the project was debating an appropriate number of threads. I suggested that we get as close to the theoretical max of one thread per test as possible. While I would have loved to use 2000 threads, the application itself bogged down when creating more than 64, which was still roughly a 4x improvement over what hey had before.

In contrast, if you allow unrestrained growth on the data pipeline, all the history, never delete anything, and refuse to add more threads, you can be certain to have a slow and expensive system.