Total Cost of Data Ownership

My personal finances started to look much better when I internalized total cost of ownership. Total cost of ownership (TCO) provides a more complete picture of any purchase. A great example is buying a car. Folks selling cars will always try to point to a low initial price or a low monthly payment as a reason to purchase. Savvy buyers will consider the cost of interest, maintenance, gas, insurance, time spent driving, etc., as the actual cost of the car. As an aside, that’s why I purchased an electric Kia Niro a few years ago. The somewhat higher upfront cost but much lower maintenance and fuel amounted to a low TCO.

TCO is an important consideration for data projects as well. In particular, discussing what to add or remove from a project often fails the TCO analysis. The upfront costs are low, but the long-term effects grind development to a halt. In particular:

Little or no documentation about a project’s state. If we aim for a minimal TCO, we’ll want to strike the right balance between ‘no docs’ and ‘document everything all the time.’ Most data projects don’t consider documentation part of the project and don’t have requirements, so they swing between documenting nothing and pausing all work to write voluminous docs that go out of date almost immediately. Both strategies are very poor in TCO. A better approach is to consider the initial documentation needs and produce only what the users and developers intend to rely on. Typically, a concise, broad overview of the intent of data sets provides enough information for end users.
Data pipelines have minimal or incomplete CI/CD. This is an especially bad trade-off since data pipelines will spend much more time iterating and maintaining than the initial build. Most developer time is then spent figuring out how to make changes safely.
No method of removing deprecated pipelines. Almost every company struggles with this problem. Adding new data sets is straightforward, but removing one requires multiple sign-offs and an endless search for usage. If getting rid of code is hard, but writing new code is easy, it shouldn’t be a surprise that data projects continually increase in size. Some teams attempt to combat this by having a high bar for getting anything into the code base, but this only limits the growth. Making code cleanup and deprecation easier is a more effective long-term approach. Even a simple approach of removing data sets that fall below a certain usage can mitigate this problem.

When asked to estimate the time to create a new feature, we should also consider the time spent on maintenance and everything else in TCO. We can’t pretend those are free, just like we can’t pretend our monthly payment is a good representation of the cost of that new truck.