Chasing down duplicate rows in analytics systems may very well be the primary reason my beard is grey. I’ve lost a few workdays to very hard-to-find duplication issues. Since any column ‘can’ be the primary key, and most analytics systems don’t enforce them at the database level, it’s virtually assured that at some point joins will lead to row duplication. Any row duplication will make the resulting tables and everything downstream unreliable. It’s the worst way to start the work day.

It’s not inevitable that this problem will happen. For the first few years in my career, I worked for a BI product company where row duplication rarely happened. Here’s how we did it:

All tables had a primary key called ‘id’. All joins were on the primary key. No exceptions.

At the cost of being slightly less convenient than not having this rule, the benefits were immense:

  1. Join conditions are all known in advance. Row duplication is nearly always the case of an incorrect join condition. When any column, or set of columns, can be the unique key mistakes are much easier to make. And much harder to diagnose.

  2. Debugging joins becomes trivial. Most of the SQL reading effort can be spent parsing the select criteria, where most updates need to be made. The join conditions simply need to follow the ‘id’ standard.

Some workarounds are necessary, but fewer than you might think.

What about event tables where there is no obvious primary key? For these, autogenerating a key works just fine. The benefits of this approach are most important for tables you expect to join to. For fact tables that are final products, autoincrementing works fine.

What about tables that already have multi-column primary keys? These are the simplest, as combining the two fields with a predetermined delimiter should be all that’s needed. For example:

select
  CONCAT(bill_to, '-', shipto) as id,
  bill_to,
  ship_to,
  ud_field,
  other
from
  ship_to_table

The benefits of making join conditions more straightforward is hard to overstate. Most analytics systems need more rules and structure to be widely useful, and settling on these conventions is a good start.