Every Database is Special

How the story begins

Every time I work with a new data team, someone insists that the problems they’re facing are unique. The challenges they face are industry specific, their tech stack isn’t what “most teams” do, or their users need it to be a certain way. I’ll nod and listen carefully. There’s always a chance it might be true this time. Usually, once I get a grasp of what they feel the current issue is, I ask them how they got there.

“Our data team started off small, it was just one engineer.”

“We put together a couple of reports with the tools we had.” It’s probably MS SQL Server, or they might be using a DBaaS tool like Snowflake or Databricks.

“Lisa manually triggers the report and puts it in Excel.” Sometimes it might be a dashboard. If Power BI came with their Office subscription, they’ll use that.

Everything seems to work great, so business users notice and start asking for more reports. The sales manager wants a better way to view monthly sales. The product team wants to see inventory trends. The finance team wants to know… everything. The level of complexity in the reporting increases.

There isn’t much formal structure to these transformations. No one really has time for that anyways. The engineer writes a new SQL query for each new dashboard or report as the requests come in. Every once in a while there’s a new piece of business logic, maybe filtering out a certain class of products, that gets included. Copying and pasting the SQL into the new query becomes easier than creating a single source for the product data, so the engineer does that. It would be nice to clean up the duplication, but business users are skeptical of any time spent on anything other than handling their report requests in a timely manner.

Costs begin to increase with all the additional use. Business users start to notice their reports don’t all return the same numbers, and sorting out the differences requires many hours of diligent debugging. Someone notices a few critical reports are using different filter criteria. When asked which one is correct, each business user prefers their method, so the underlying data is left untouched. “We’ll just have to know that it’s different,” they’ll say. This is an early sign that the wheels are falling off the wagon.

After a period of growth, the business finally decides to add headcount to the analytics team to unblock some of the more critical work. But by then, the system is complicated enough that it requires a large chunk of the first engineer’s time just to explain how it all works to the new people. And because of all this hidden complexity, it’s very difficult to update the transformations without unintended side effects. The old transformations are so brittle and complex, new engineers just add additional ones instead. As a result, overall development remains slow and the gains expected by adding more people are never realized.

When this state languishes for long enough, engineers simply give up on improvement. Business users become accustomed to extremely long lead times for changes. Leadership becomes increasingly skeptical of the team’s ability to deliver. Learned helplessness sets in. And attempts to improve the situation typically fail. When asked why things haven’t improved, the question is inevitably met with a shrug and a resigned, “that’s just the way things are.”

What to do next?

Most managers have an underlying feeling there must be a better way. “It can’t be inevitable that data processes always end up as expensive, slow-moving, tangled messes,” they’ll say to themselves.

Some businesses deploy new tech as a solution. Newer is better, right? Business users tend to go along with this. They view it like upgrading computers for faster, better, newer ones. And it’s much easier to sell the executives on the idea that their problems are caused by outdated infrastructure. If we just get the new things up and running these pesky problems will go away. Also, engineers like getting new tools, so why resist?

These migrations nearly always take longer and cost much more than was expected or promised by the vendor. Most businesses take the seemingly reasonable approach of directly moving their existing code to the new tech stack. They want to make sure that nothing changes. So the engineers directly port their code, changing as little as possible to make the existing code work with the new tech. The new features, which were the primary selling point, are ignored with the thought that they’ll be implemented in a future that may never come.

After a long period of ignoring new development, increased infrastructure spend, and outages caused by the move, the business ends up with the same problems in a new container.

The old tangled mess becomes the new tangled mess.

Total cost of ownership

In the story above, the engineer used whatever tools were available to get the task done as efficiently as possible. This seems like a reasonable approach. And in the moment, it probably was. So what’s wrong with that? A lot, it turns out.

Consider the total cost in time and money of writing, debugging, and maintaining the code for the entire time it’s in production. The initial cost is low. But as the code ages, the cost to debug and extend it increases. Costs are back-loaded. The ease of writing in the first place creates an illusion of a quick and nimble development process, only to see it slow to a crawl as complexity increases.

In terms of technical debt, we’ve bought early on credit and become swamped by interest payments as time goes on.

What kind of debt are we talking about?

There is no avoiding it; the cost of technical debt must be paid. It’s better to figure out how to pay our debts in the beginning, before they overwhelm us. In the context of data systems, debt is accrued in ways that aren’t immediately understood as technical debt. Consider any action that must happen later as debt that must eventually be paid. Many less perceptive managers balk at the idea of spending a week of development to automate a process, but have no trouble with the developer spending a day each month to do it manually. Here are some other examples. Do you see any that look familiar to you?

Manual triggering any database operation which will be repeated many times in the future.
Nonexistent development environments that require engineers to create new testing routines each time.
A lack of documentation, forcing developers and users to constantly ask others for necessary information.
Switching tools after finding they lack essential features.
No unit testing or data quality testing, so every error must be investigated manually.

Each of these are examples of technical debt that will accumulate during the life of the system. And because these are often invisible to end users, it creates the appearance of everything working great until out of nowhere it all grinds to a halt. Many data engineers aren’t accustomed to viewing these missing pieces as technical debt and lack the expertise to describe how the situation has unfolded.

The most common reaction upon hearing all this is to say, “Yeah, that’s great in theory, but we have deadlines and obligations. We really need to keep up our velocity.” The person saying this is really hoping that’s the end of the subject. Since there’s a lot of turnover in our industry, you might be able to get away with this for awhile before the payments come due. But the payments always come due eventually. You’ll hear excuses like, “Now is a bad time. And tomorrow is, too.” “Who knows what the costs will be?” And, “The future is too hard to predict. Let’s just do what we know will work now."

Pay with cash, not with credit

Paying debt now is the better long-term strategy. This ensures we aren’t fooled by rapid development in the early stages. We take a long term view of every project and carefully consider the total cost of the entire lifecycle. This prevents us from taking shortcuts that end up with long delays. When it comes to best practices, considering how they affect the total cost of ownership makes their benefits clear.

Here are five practices I consider essential:

Put all database transformations in source control.

This can be a painful change to make if the team has a habit of writing stored procedures and testing directly in production. But it’s both possible and necessary. A sufficiently sized analytics system operating outside source control will become so unwieldy that updates become virtually impossible. Every single change introduces the possibility of breaking an unknown dependency. If you’ve ever played Jenga, you know this feeling.

Learning to use source control like Git requires upfront effort, but the payback is enormous. Having an issue with a specific table? View the history of what was changed and by whom, for everything. Need to know how many references to a specific table exist? A simple ‘grep’ command will tell you. Things that would have been a mammoth undertaking can be done with a single shell command.

Automate deployments through the source control or orchestration system.

After getting everything in source code control, the next step is to automate the deployment of updates. This ensures the production system is always in sync with source code control. Additionally, the orchestration system should be the only tool that touches the production environment. With this rule in place, production is always in a known state and engineers will never have to wonder if something has been manually changed. Orchestration systems provide detailed logging for all actions, eliminating any mystery as to when or how things have changed. Knowing the state of all transformations and all actions taken on the system eliminates entire classes of errors and greatly simplifies debugging.

Create isolated development environments for engineers.

Once everything is in source control and deployments are automated, it becomes easy for developers to work in their own isolated environments. A transformation framework will make this as simple as editing a configuration file, and then running the transformations in the new schema. The isolation is critical. Many data teams use a shared staging version of the production data, but this makes it impossible to know if another developer is working on the same tables, or how recently the data has been synced. Guaranteeing the state is necessary to be sure that any changes will have the desired effect. Some will scoff at the duplication that comes from individual development environments, but the benefits far outweigh any costs. If the datasets are massive, there are ways to limit the data sizes or which transformations are executed. This isn’t an excuse to avoid the necessity of isolated development environments. Proper isolation eliminates entire classes of bugs from sneaking into production.

Execute data quality tests in development and production.

With the first three steps in place, developers can now implement testing. Without the above three, testing will be unreliable and inconsistent. Tests won’t reliably report success or failure when moving from development to production, negating the benefit of tests.

If you’re not sure what to test, start by thinking through everything you assume is true about the data and test it. It can be as simple as knowing a proper historical value for monthly sales or that certain values appear in certain columns, and testing that they never change. Most teams write too few tests. If something needs to be true, test it.

Let me repeat that. If something needs to be true, test it.

Treat test alerts as failures.

Fix the data issue or fix the test. Never allow a test to fail without addressing it. Remove failing tests if the issue is already known. Once a single test starts failing consistently, it becomes easy to ignore new test failures, and issues will mount. Test thoroughly and maintain passing tests.

Where to next?

Implement these five workflows. They are the groundwork for moving from a small proof-of-concept project to a large-scale system that can support dozens of developers and business users. Changes become quick and reliable.They are no longer risky and troublesome. Instead of a host of unknowns, the entire system is visible and navigable through any off-the-shelf text editor. The history of all changes is searchable. And new developers can be onboarded quickly.

By paying your debts upfront, you front-load maintenance, which then allows your system to adapt quickly as requirements, users, and developers change.