Querying the Production Database Directly Considered Harmful
Every data engineer wishes our work could be simpler. Sometimes it just takes one particularly bad afternoon of debugging for it to unravel. You find a duplicate row in the third of seven downstream transforms from one table and you throw your hands up in the air and shout “Why do we have to do all this? Can’t I just write a query against the production database and be done with all this?” And while maybe you could get away with it this one little time, you’re riding your data motorcycle without a helmet, and you will eventually crash.
Data before staging
Back in the early 2000s AWS was just a gleam in Jeff Bezos’ eye and the idea of paying 1000% markup on database hosting sounded absurd. Back then, the database server lived in a closet and your friendly company IT person would give you credentials (usually their admin credentials) to connect. From there you could fire off all the queries that Dave from accounting wanted, and then go on with your day. Since no one exposed their database to the public internet, and Mission Impossible style break-ins weren’t very common at Jack’s Midwest Tool Supply, this approach worked just fine.
Back then, directly connecting to the production database was normal. Everyone did it. On a small enough project, you were unlikely to run into any problems. But as the system grew, hidden cracks began to open up. Cracks that could really trip you up.
Finding faults
One problem with the “direct connect” solution is the additional load it can put on your production database. I had a customer early in my career who kindly asked me to only run queries on the database after 5:00pm. He explained that they processed orders at the end of the day and didn’t want any interruptions. I balked. Why should I have to work late? 4:30pm was close enough, right?. Databases are resilient. They can handle concurrent queries, I said to myself. I kicked off a query at 4:40pm. My phone started ringing two minutes later. Back then, people answered their phones when they rang without caring who it was. It was my customer. He less kindly requested I stop blocking orders from being processed.
However, database load isn’t the worst problem. The scariest prospect of this kind of direct database management is accidental deletion of production data. I recall a particularly illustrative incident that still keeps me up at night.
A coworker once decided to get creative with some extraction logic. The problem, as he saw it, was that we used a two step process for replicating data . First we’d pull the data into an intermediary file. Then we’d load that data into the analytics database. Being a seasoned engineer, my coworker declared this to be unnecessary duplication. Instead, he’d operate directly on the production table, bypassing the intermediary file and what he deemed additional effort. His solution worked fine on the first run. What got him was that he forgot about the part of the process that truncated the intermediate table on each restart. On the second run of his solution, all the customer’s invoiced transactions were deleted. In production. I haven’t seen him since.
While I’ll miss the baggy jeans and big brick cell phones of the early 2000s, I’ll never go back to operating without a staging layer to provide a safety net.
Set up a staging layer
The goal of the staging layer is to make a copy of the data from the source system and put it on the analytics database. Then all analytics workloads can be executed on the copy of the data instead of directly on the source database. So how do we do that?
Let’s start with the easiest possible solution to this problem: Copying one table from the source database and putting it in the analytics database. At a high level, it looks like this:
1 Connect to the source database
2 Select everything from the desired table
3 Connect to the analytics database
4 Write that data to the analytics database using some variation of a `create or replace` statement.
A Python script can do this with no more than a dozen lines. And you could even run it on a schedule from any server’s crontab.
It’s a simple solution, but don’t underestimate how far you can go with simple solutions. With small datasets, from a single database, and daily reporting, even a medium size business could be well served, as long as there was a bit of extra monitoring.
A big advantage to such a small amount of logic is consistency. Because you’re copying all the data each night there isn’t the extra work to ensure all the data is there like you’d have with an incremental approach. If the columns available in the source change, you’ll get the updated version without needing to manually apply any DDL statements. You also won’t have to worry about missing any rows or capturing deleted rows. It’s a snapshot of a moment in time.
Some engineers quickly abandon the full copy approach in an attempt to optimize performance. The thinking is that copy operations are computationally expensive and wasteful, especially when most data hasn’t changed. Capturing the difference is the way to go, they’ll tell you. In most situations,this is a very poor trade-off. Hardware bottlenecks for small datasets are rarely a problem. Now consider the cost of additional monitoring for a more complex approach. And don’t forget the possibility of missing or duplicated data that just won’t happen with a full copy approach. The minutes saved during nightly data extractions can easily be lost to painful debugging. Full copies are repeatable and reliable.
When simple isn’t enough
So how do we know when we’ve outgrown what our daily full copy process can do? Don’t worry, there’ll be warning signs. You’ll start to hear a chorus of ‘My reports need to be refreshed hourly, daily will never do,’ from any number of departments. Or the accountants will show up and want you to lower your cloud costs or lay off your best engineer. Or when you’ve grown into genuinely big data that contains hundreds of millions of rows and the hardware bottle neck has become a problem.
Before you give up your simple solution in favor of something more complex, use your most shrewd negotiating techniques to push back. If cost is the problem, look for assets that aren’t being used and delete them. Fewer assets mean lower costs. The delete key always improves costs by 100% and is very easy to code. If latency is the problem, make them explain why.. Users always want lower latency, but can rarely justify it. Sales managers will ask for their reports to be updated hourly, even if mid-day invoice totals are more misleading than helpful. They assume that all data is real-time and that is somehow better. Disabuse them of this thinking.
In the end, if you’ve done your best, and the ‘copy once per day’ system you’ve built isn’t cutting it, there are a few solutions that can extend the life of this type of system. An incremental approach may meet the needs without having to move to a more robust approach.
The “easy up front, hard to debug” way
But be warned: incremental updates carry a lot of hidden headaches. With our ‘copy the whole table’ approach, we are guaranteed that the table is in a consistent state. We know that all the data was copied over. But when we switch to an incremental strategy, we now have to manage the state of the source table. And a lot of things can go wrong right here. We move from surefire consistency to ‘I hope nobody deleted any rows.’
With the warning out of the way, here’s the most common method.
Simply find and copy new and changed rows. Look for a datetime field in the table that represents the last time a row was created or updated. Most tables will have something like an `updated_at` or `created_at` field. At the beginning of the copy operation, get the highest instance from that field in the target table. Now run a select operation on the source table and filter for records greater than the datetime you selected at the beginning. The new operation looks like this:
- Connect to the target table.
- Read the highest datetime from the `updated_at` field in the target table.
- Connect to the source table.
- Execute:
select
*
from
source_table
where
updated_at > (select max(updated_at) from target_table);
- Copy the rows to the target table.
When successful you’ll have all the new rows copied over a significantly less computationally expensive operation. This system works, but it has some easy-to-spot failure points. Before you get too excited about this strategy, consider these downsides:
Changes to the column structure need to be applied manually
Missing from this strategy is any way to deal with column structure changes. If you’re running commercial software that’s infrequently updated, this probably won’t be an issue. If you’re working on a custom application under active development you’ll need to plan for this. Building in schema checking and updating will be required to avoid these failures.
Updates to existing rows
Unless your tables are append-only, you’ll need a strategy for dealing with data that’s updated since the last copy event. For that, replace the `select * from tablename where updated_at > (max(updated_at)) with a `merge` statement. This requires knowledge of the primary key for the table in order to accurately update the rows. The syntax for merging is significantly more complex than appending. Programmatically generating these statements is recommended, and most programming languages have libraries for this.
Deletes to existing rows
Hard deletes make the incremental strategy virtually impossible to implement. Because the query is only scanning for new or updated records, rows deleted from the source won’t be replicated in the target table and will cause the two tables to become out of sync. It’s better practice for source tables to do soft deletes; a way of marking a row as no longer in use without actually removing it from the table. A simple method for thisis to include a column such as `is_deleted.` Even better, use a column called`deleted_at` and include a timestamp, and simply update the row with that information when the row is no longer in use. If the tables you’re syncing include this information the incremental strategy can still work.
It’s still possible to run an incremental strategy without those columns by running a comparison between the available primary keys in the target table against the source table. This strategy involves a full scan of both tables, and in practice can take nearly as long as copying the entire table. In my experience, deletes ultimately derail the effectiveness of this strategy and force our hand to implementing complete log based CDC.
The ‘copy it all ‘ strategy is a brute force method, but works when not much force is needed. The incremental methods mentioned above are compromises. They provide an easier implementation at the cost of genuine confidence that the target table matches the source table. Incremental methods rely on software systems consistently managing update and created dates. Software systems rarely do so in a reliable manner. To ensure consistency we need guarantees from the database.
Change Data Capture (CDC Database Replication)
Setting up and maintaining your own replication system is the most complex, but most robust, thorough, method of syncing databases. There are plenty of open source options for implementing replication, Debezium being the most popular choice. Third party paid tools include convenience features, like fully featured UIs, that are often worth the cost. Setting up replication can seem like too much work up front, but the dividends paid later are worth it. Unless something breaks, you always can be certain the data is up to date.
The Change Data Capture method uses the database’s WAL (write ahead log) to capture changes and broadcast them. These change events are typically sent to a messaging system like Kafka so they can be subscribed to by the target system. The WAL must be set up ahead of time for this method to work. But because every change to the database is logged, it is a reliable source of truth.
In this case, a change event would be any INSERT, UPDATE, or DELETE to a row. When implemented, the target system begins with a full snapshot of the table and then continuously follows the transaction log, allowing the same sequence of changes to be applied to any target system.
In this method, the log requires only minor maintenance. Most logging systems will either flush the log after it’s read by the subscriber, or automatically delete logs after a set period of time, such as 30 days. Logs can take up significant amounts of disk space, so tune your system carefully.
Any good CDC implementation will also include the ability to re-run the snapshot on the current data, so even if the data gets out of sync for any reason it’s always an option to get a fresh copy. Some engineers will insist on figuring out exactly how it got out of sync. This is usually not worth the time it takes when re-syncing is built in.
Using APIs
One other method to consider is retrieving your data via an API. This is going to have less flexibility than with a database connection, but if you’re using a third party SAAS product it might be your only option for accessing your data. There are a few unique challenges that come with this method.
Performance.
APIs rarely return data anywhere near the speed of even a modest direct database connection. Small datasets work fine, but once you reach data in the millions of rows, updates can take hours. Incremental updates of data may be the only way to get anywhere near acceptable performance. The slow speed of data transfer compounds into additional problems.
Retries.
Network connectivity must be maintained during these long data refreshes. Connectivity issues can come from either the client side or the server. If the target server loses connection there’s no choice but to restart the data pull. And since no SAAS solution guarantees 100% uptime this has to be built into the data system.
Pagination
The design of the API is essentially up to the developer who implemented it. Standards are improving over time, so you may be able to rely on some consistency from system to system, but it’s far from a guarantee. While SQL is frustrating at times, its main strength is consistency across a broad set of vendor implementations. APIs don’t enjoy that level of consistency. While projects like OpenAPI have made progress in this direction there are still plenty of things that need to be done manually.
Most APIs allow you to specify how many records to return for each API request. If there are 2,000 records to return and the API accepts a request of 100 records per page you can expect 20 api calls to return all the data. The initial API request includes instructions for making the next request, often by returning a new link in the response headers. Any program accepting this data will need to know how the pages are organized in order to effectively make the next API call.
Throttling
Let’s say you want to return 2,000,000 records, and you’re running multiple data pulls at the same time. To avoid system degradation and cost, SAAS systems typically impose limits on the number of API calls they’ll accept. If this happens during a data pull you can expect to receive a code 429 Too Many Requests response. When this occurs your program will need to respect the 429 and pause or slow down the rate at which it makes requests to the API. Improperly handling these will cause your data pull to fail and you’ll need to try again. Considering you’ve likely hit API limits this can be a frustrating game of ‘wait and see.’ Rate limits are often in the fine print of your terms of service and change at the whim of your SAAS provider.
Limited Query Options
API’s will typically accept additional arguments as filters for the API request. These usually appear in the form of a ‘since’ or filtering on a unique ID. But don’t expect the kind of granular control you can get with a typical SQL query.
Primary Keys and Nested Data
Since the schemas for the database won’t be available through an API, it may require guesswork to determine what fields are unique. Here again we’re at the mercy of the source system and what it can provide. Not only will we need to manually determine primary keys, we’ll also need to unwrap any nested data fields if we plan to push the data to a more traditional SQL database.
The limitations inherent in dealing with APIs make them a poor choice for doing any complex transformations. Instead, write the data to the target system in its nested form, then unwrap them in downstream SQL transformations. This makes updates simpler to maintain. If the logic for transformation is executed during the API pull, it will require re-running the entire API call to make changes. Logic in a simple SQL view layer in the database can be updated at runtime. Modern data lakes all have support for JSON style data and can make downstream unwrapping of transformation logic mostly painless. Tables in a SQL database can be as simple as a few columns, one for the unique key, one for the JSON payload, and one or two more for datetime on the load or any other relevant metadata.
The End Result
Doing this allows the analytics workloads to safely run on an exact copy of the data from the source system. There will no longer be any need to worry about locking up tables in the production system or accidental deletion of data.
If you build a robust system that handles all the issues brought up above, you can be certain the upkeep will be relatively painless. If instead you build in lots of stateful incremental processes, have APIs with lots of unhandled exceptions, and have poor monitoring systems, you can expect long days at the office.
The staging layer provides a level of safety that is required for any serious organization. If you’re finding it difficult to maintain you’re probably suffering from brittle incremental processes or unhandled API issues. Spend the time and energy to implement log based CDC so you can let the database do its job. Build robust handling into all your API requests and keep the business logic out of them. Stick to these standards and you’ll spend much less of your time babysitting data pipelines with your fingers crossed.
And even in very poor implementations, it’s a lot better than deleting all your customer’s invoices.