Git For The Reluctant Data Engineer

The modern data engineering landscape has no shortage of ways to shoot yourself in the foot. One little change and the whole system can come crashing down. When you’ve seen one outage too many, it’s time to stop relying on fifteen open tabs in your DB tool and get organized with a well maintained Git repository.

In modern software development it’s nearly impossible to avoid using Git. Git provides a line-by-line history of every change made to a codebase, code review process via pull requests, and makes creating copies of the codebase for local or staging environments easy. These features are so common and necessary that the first task for any new software project is setting up the Git repository.

But it seems that data engineers think they are exempt. Working directly on the database and running ad-hoc queries is standard operating procedure. Instead of code being repeatedly executed like you find in a more traditional software team, data engineers are often executing one-off ALTER or INSERT statements. Once a query is run, it’s never expected to be run again. If you ask a data engineer, they’ll tell you, “These types of operations don’t really work in the Git model. Besides, it was a one-off. We don’t need to track that.” Source control systems are either ignored or implemented poorly as an afterthought.

Another barrier to using Git is that it’s perceived as difficult to work with. It’s true that some of the conventions are unintuitive, and can take time to learn and understand. And when you’re in a hurry to ship the latest update, these can seem like unnecessary roadblocks. But the trade-offs are worth the small effort required of the learning curve.

Source control use among data teams tends to settle into one of these five levels:

Level 1: We don’t use Git. Everything is running fine and it’s unnecessary baggage.

Level 2: We have a repo that our engineers use to store some of our procedures in.

Level 3: Most of our code is in a repo, but I’m not sure how up to date it is.

Level 4: We use a framework that requires we check-in everything and deploy from there. We still do some manual stuff when we need to.

Level 5: We use a framework for transformations that deploys directly from Git. Even manual changes go through our production tools and are checked-in.

The closer a team gets to level 5, the easier it is to develop, maintain, and explore the data system. If your team is on level 1, you probably spend most of your time putting out fires. Even if production is relatively smooth, it’s a near certainty there are bugs hidden in your data. Database operations are probably executed on an engineer’s local machine with no code review.

Even the most disciplined team member is at risk of accidentally missing a filter criteria, copying and pasting incorrectly, or insufficiently testing a change before running it in production. These bugs aren’t usually discovered until they are reported by unhappy business users. With a lack of source control, it can be nearly impossible to know exactly what caused the issue or be certain that it won’t happen again.

Teams in this state are generally guessing at a fix and hoping that the data outputs are good. In fact, a team refusing to use a source control system is the most reliable indication that the data operations are poor quality.

Stop and think about that for a moment. The single biggest indicator of data quality is “do they use Git consistently?” It’s not how many design docs a team has or how detailed they are. It’s not the size of the team, or how experienced they are. And it’s definitely not the cost to build the data system. It’s a question of if the team is using source control consistently.

Getting started

If you’re on level 1, don’t despair. Getting started doesn’t take much work. You just need a new repo and knowledge of less than ten commands.

Start by picking a folder to put all your SQL, docs, and code in and initialize Git on that folder. Use the command:

git init

This creates a .git directory to contain Git’s metadata.

With the repo initialized, let’s start keeping a history of all the changes made to the database. Doing this will require two commands. First command is:

git add

This command tells Git which files you intend to save in the next command. The reason this exists is because you may have made changes in some files that you don’t want to save. If you intend to include all the files you can use the * as your final argument.

git add my_file.sql

git add *

Once you’ve added all the relevant files, run the command:

git commit

This is essentially the ‘Save’ command. Commits require a message attached so the full syntax is:

git commit -m 'My commit message'

Many engineers find the message requirement a nuisance, until they have to search through the history to find a specific change event. Spend a moment and consider what message will tell future team members, or even your future self, what this change did.

Each time you update your files, run those two commands.

git add *

git commit -m ‘message’

Make it a habit to run these often. The majority of your Git operations are going to be these two commands.

Once you’re comfortable using these two commands, consider adding a GUI-based tool to your workflow. While everything in Git can be done from the command line, performing basic operations like adding, committing, and browsing the commit history are easier done in a GUI tool. There are dozens of free options, and most of them contain the same basic visual style and feature sets. Pick one you like and get comfortable with the basic commands. These are some of my favorites:

Sourcetree (https://www.sourcetreeapp.com) - Free, works on Windows and Mac, and provides a simple interface for all the basic commands.

Sublime Merge (https://www.sublimemerge.com) - A companion program from the makers of Sublime Text that runs on Windows, Mac, and Linux. Will request paid registration but not required for continued use.

Github Desktop (https://desktop.github.com/download) Github’s free companion tool for Git operations.

GUI tools are great for visualizing the history of changes. The common convention in these tools is to display the history on the left with a column to the right indicating the differences (‘diffs’) line by line between each commit.

Diffs are extremely valuable when debugging issues in production. Check for when the error occurred and look for commits that occurred around the same time. The more consistent you are with using Git, the more accurate this method of debugging becomes. With proper controls around production it may be the only debugging step you need.

I’m just one person. What about teams?

If you’re a solo developer, following these steps and backing up your local machine regularly might be enough to get most of the benefits of Git. But since we’re talking about data teams, let’s assume multiple members are contributing to the codebase. The first thing you’ll need is a remote repository.

This is where tools like Github (https://github.com), Gitlab (https://about.gitlab.com), Bitbucket (https://bitbucket.org) or others come into the picture. They offer remotely hosted repositories with proper backup and security. Most of them offer a free tier you can start with, so the barrier to entry is low. Each vendor has a set of instructions that will require setting up ssh authentication, which will allow you to interact with the repo without manually entering your username and password every time. You only need to do this during the initial repository setup.

In our first example, we initialized a repo locally. When working with a remote repository, I recommend creating the repo in the remote source, then cloning it locally. Use the command:

git clone remote_location:my_remote_repository.git

This copies the code to your local machine from the remote source location. The local Git repository needs to be configured to point to that remote repository, and cloning it from the remote repository does that for you without needing to manually configure the remote location.

The distributed nature of the remote repository requires us to learn two more commands:

git pull

git push

The pull command is essentially a sync command. It takes any changes that have occurred on the remote server and applies them to your local repository. The push command does the reverse.It takes your local changes and applies them to the remote repository.

Dealing with branches

Pulling and pushing gets more complicated when there are multiple people committing to the same repository. When that happens, you need to learn about branching.

For our sample project we’ve done all of our commits on the main branch. If many people are directly commiting to the main branch, there is the possibility for conflicts. Also, an engineer might not want to push all their changes directly to the main branch until all the work is finished, but they want their work saved somewhere other than locally. They need an isolated way to do their work without it affecting ‘main’. To accomplish this, use a branch.

A branch is essentially a copy of the code at the point the branch is created. You can add and commit to it just like you would for the main branch. When pushing to the remote repository it pushes the branch data as well, so others can pull the code for this new branch and work with it as needed. Switching between branches is just one command, so moving between main and an in-progress feature branch makes testing out new code simple. Keeping in-progress work in a branch also means no need to revert code in main if the feature is abandoned. Just delete the branch.

We can create and switch to a branch with the command:

git checkout -b 'other_branch'

The checkout command will switch to another branch. If you are creating a new branch, include the -b argument.

If the main branch receives updates that you’d like to include in your branch, run this command to pull those changes into your branch:

git merge main

What about pull requests?

In Git development, pull requests (PRs) serve a key purpose: to protect the main branch and ensure code is reviewed before it’s integrated. They’re common, but optional. When a team is working on a codebase, they usually don’t want developers merging into the main branch without some type of approval, so they set some merge protection rules on the main branch.

A common protection rule is to require approval from multiple team members before merging to main. An engineer starts an update on a new local branch. They then push this branch to the remote repository and create a pull request (PR) in the remote’s user interface. This PR serves as a formal request for other team members to review and approve the code. Once the necessary approvals are given, the code can be merged into the main branch on the remote repository. The PR itself is maintained in the system and can be used as documentation.

My transformations all exist on the database. How do I use Git for that?

If you’re lucky enough to be creating a brand new data system from scratch, use a framework like dbt (https://www.getdbt.com) where Git is built into the development process. For established or more complex systems it’s still possible to improve your processes by using a source control solution.

A good first step is putting your views and stored procedures into Git. Whenever something needs to be updated, put it into a text file and enclose the content in your DB’s version of a CREATE OR REPLACE syntax. It’s minimally disruptive to your workflow and immediately provides some benefits. At the very least you’ll have a copy of the view or procedure in your repo in case something needs to be rolled back later. It can also be used as a comparison later if you need to debug something.

Are you worried someone made changes without telling you? Just pull the current version off the database and compare it to your version in Git. Make a habit of adding new views and procedures every time you interact with them. If your team has bad or nonexistent processes around source control, even these small improvements will help.

My team is reluctant to use Git. What can I do?

Fortunately, one stubborn engineer can do a lot to move things forward. If the team doesn’t have a repo, start your own. There’s no need to be pushy about it, just consistently put all your code there. When team members ask questions, send them direct links to your code on the remote repository. Most vendor implementations make it trivially easy to share links to individual lines of code. When someone is trying to debug something and wants to know what could have happened, point them to the Git history you’ve maintained. When discussing a change you’re working on, make a pull request with the diff and show it to the other team members. Most engineers will go along with it, and might even adopt some of your ideas if you’re persuasive enough. Say things like “I can’t remember everything that’s changed unless I keep it all in the repo,” or “The diff makes for a nice view of everything I’m updating and ensures I don’t make any accidental changes. You can never be too careful.” The soft sell tends to get more traction than an edict.

A well maintained repository is better received by management, tends to settle more arguments, and generally makes life easier. When engineers are debating who is responsible for an issue, the one who can present a complete list of changes with timestamps will be taken far more seriously than someone relying on a hunch. And when leadership wants to know who gets work done, it’s much easier to point to a list of pull requests and merges than a vague sense of being busy. Even if you’re not convinced of the practical necessity of Git, the aura of professionalism that exudes from doing work this way carries a lot of weight. It demonstrates that you care about the work.

While there are entire books written on Git (and I’d recommend “Pro Git” by Scott Chacon and Ben Straub for further reading https://git-scm.com/book/en/v2) the vast majority of day-to-day use is in mastering and using just a handful of commands. Considering this is the most direct way of evaluating the quality of a data system, you won’t find a better ROI from learning a few simple commands than this.