Each Table Has Its Place | Rows and Columns

Back in college, I was the guy to go to for all your computer issues. I could play all the hits like “You need the correct printer driver installed,” “Have you considered an OS from this decade?” and my personal favorite, “Have you tried turning it off and back on again?”

Most of these issues were easy to solve, so I didn’t mind being the hero. But there was one thing I stumbled on that still haunts me. I went to help a colleague connect to the Wi-Fi, and when I closed the browser, I discovered a digital disaster. His desktop was a chaotic mess of hundreds of files, stacked so deep they completely obscured the wallpaper.

I asked him if he knew about folders and if he had considered using them.

“Also, maybe don’t call them Untitled1 followed by Untitled2,” I added.

“Why not? This is much easier,” he argued.

He spent ages dragging icons around like it was a game of desktop Tetris, so I doubted his claim that it was “easier.” I don’t think he ever discovered sorting or the Finder window, and in the end, he kept his chaos exactly as it was.

But data professionals know better, right?

If your data project is built one table at a time with no advance planning, it’s almost inevitable it will resemble the structure of my poor friend’s computer. With no hierarchies, folders, or naming conventions, it becomes one big schema with a bunch of tables and views. Projects like these end up with tables named v_sales_orders_detail_v2_fin followed immediately by v_sales_order_detail_no_sp_2_fin.

When you’ve only ever worked this way, organizing will be extra effort. But a clean setup from the start pays off every time you touch it. Good structure lowers cognitive overhead, speeds development, and cuts down on mistakes. And when you bring someone new on board, you won’t be stuck explaining odd names with a shrug.

Starting with table names

The first thing to establish is the naming convention for the project. Architects like to start with big ideas and diagrams, then work down to the details. Builders, on the other hand, start with the foundation and work up from there. I prefer to be a builder. A good naming convention provides a solid foundation for your project. It’s easier to agree on, provides a consistent description of what tables do, and negates the need for big diagrams. If your team can’t even settle on what to call tables, it probably means you shouldn’t be working on higher-level abstractions yet anyway.

Let’s take my bad example from above and see how we might improve it.

v_sales_order_detail_no_sp_2_fin

Do not include metadata about the table in the name.

Some teams like to include a reference to whether this is a table or a view. Adding v_ or t_ somewhere in the name is the most common way of doing this. The rationale is that this gives the end user information about how latent the data is. But the assumption that views provide real-time data can be mistaken. Frequently, views are downstream of other tables that refresh on an hourly or daily cycle, making the data just as delayed as a table.

This practice also makes changing the materialization type more difficult. Perhaps the view has become large enough that queries are unacceptably slow, and it should be converted to a table. Would the name be changed to reflect that? Would changing that break upstream references? Whether it’s a table or a view can easily be answered in your db exploration tool. Including a label for it in the name only adds noise.

Upon removal of the label, we get:

sales_order_detail_no_sp_2_fin

Do not include version numbers or finish state in the name

If your table-naming strategy is “original name plus 2,” it’s time to stop and rethink this approach. Most requests that look like duplication can be solved by applying the right filters. Requests like “Can you give me that same report but include special orders?” can be handled directly in nearly any BI tool. Keep your tables general and do your filtering downstream.

The use of fin or temp should also be eliminated for similar reasons. If it’s in production, we should assume it’s finished. The best way to indicate a table is temporary is by deleting it, not by claiming it’s a temp table. Temp tables have a bad habit of living much longer than engineering teams expect. No one wants to be the one explaining why that temporary table is still there five years later. By the same token, labeling a table as fin will inevitably cause confusion about what distinguishes it from all the other tables that aren’t marked fin.

Remove the fin from our table name and now we have:

sales_order_detail_no_sp

Avoid unclear abbreviations like `no_sp`

If an abbreviation is absolutely clear to everyone in the organization, then use it. But for anything even remotely unclear, use the whole word. Long table names are much less of a problem than confusing ones.

After the abbreviations are removed, our table name looks like this:

sales_order_detail

Avoid vague table typing

Before cloud data warehouses and columnar data storage became the standard, it was common practice to precalculate aggregations in a table labeled agg and differentiate that from the detail table. At the time, this made sense. But in a modern data system, we can assume that a table is detail unless otherwise specified. If your datasets are exceptionally large it may be necessary to pre-aggregate them into a downstream table. If so, append your table with the type of aggregation, like daily or monthly rather than the generic agg.

Specific use guidance in your table names can provide value and should be included when appropriate. For example, in an analytics system these table types are commonly included in the naming convention:

fact - Fact tables describe events that have occurred. Common examples of these would be sales invoice data, general ledger transactions, or click events in marketing data. These are typically the largest tables in the system.

dim - Dim (dimension) tables include data about things. An example of this would be a product table, with each row containing information about each product in inventory.

Some less common table types include:

stage - A staging table applies simple transformations to source data before it reaches a fact or dimension table. When common business logic needs to be applied across multiple tables, place it in an upstream staging table. Staging tables are rarely queried in visualization tools. Their purpose is to centralize transformations and reduce redundant logic, and they are often implemented as views.

int - Use the int label when you need more complex logic than a staging table allows. Intermediary tables perform calculations and aggregations that downstream tables depend on. Their main differences between intermediary and staging tables is complexity. Also, intermediary tables are usually built as physical tables to handle performance considerations.

map - Use a mapping table when you need to associate a short code with a longer description. For example, state abbreviations can be mapped to full state names. These tables are maintained by the data team, not the source system. They keep fact and dimension tables efficient by storing codes instead of long strings.

Prepending the table type to a table name sorts the data according to its type, which makes searching through the data easier. Most business users know what type of data they are looking for, but not necessarily the name.

Using fact is appropriate for our orders table, which now gives us the title:

fact_sales_orders

If you keep all tables in one schema, use a naming format like {department}_{type}_{content} to keep them organized. This would look like:

sales_fact_orders

purchasing_fact_orders

finance_dim_regions

Now we have a naming scheme that describes what a table is in an easy to understand manner. But there’s one more piece to this naming issue that deserves consideration.

Be as precise as possible when naming tables

This is where your domain knowledge can make a big difference. This table is called orders, but the word orders may have multiple meanings in your organization. Sales orders and purchase orders are the most common. But what about change orders? Is an order an order if it hasn’t been invoiced yet? Precise terms are always better than generic ones. When in doubt, ask the domain experts. The pain of using a confusing or incorrect name lasts much longer than any temporary embarrassment from asking a clarifying question.

I spent several years at a company that used a table called data_science_enums. It turned out that it contained enums for their entire SAAS application, and the data science team was clearly annoyed at explaining it to every curious new engineer. The intern who created it may have done some other great work, but that terrible table name might be the only thing anyone remembers of them. Spending a few minutes upfront to get a more accurate name would have spared a lot of confusion.

Let’s talk about databases and schemas

All these tables need to reside somewhere. The next step is implementing a strategy for schemas and databases. In the same way that table names should consistently describe the nature of their contents, databases and schemas should make searching for data intuitive. Finding data should never require a litany of messages between engineers.

For a typical analytics style system here’s a recommended design for databases and schemas:

Production - Only include data used by end users in this database. An engineer should never have to ask whether something in the Production database is being used. If it’s not being used in Production, get rid of it. If it’s Staging or archived data, it doesn’t belong here. A schema-per-business-area model works well. Sales data belongs in a SALES schema, where a table could simply be fact_orders. Other typical schemas might cover purchasing, inventory, or finance. In general, align schemas with the business structure. If users need purchasing reports, build a PURCHASING schema.

Staging - This is an optional middle layer that can be used to test changes before merging them into Production. It can also be used to let internal users outside the data team review updated tables before they go into Production. The schema design for staging should mirror the Production database.

Development - Use this area for development artifacts, with each developer working in a personal schema. Purge development schemas regularly to save resources, clear outdated data, and avoid leaking sensitive information. Schemas in development databases work differently than Production. Each engineer should have a personal schema, typically identified by their name. This may require a change in naming conventions: instead of PRODUCTION → SALES → fact_orders, a development schema might be DEVELOPMENT → BRIAN.OWEN → sales__fact_orders. Modern analytics frameworks such as dbt (https://www.getdbt.com) can apply these conventions automatically.

Raw - Raw databases should have one schema per source system and only contain data from external sources. For example: RAW –> SALESFORCE –> accounts

Backup - Data engineers hate deleting anything. And while data warehouse vendors are providing more and more tools for automated backup, it’s worth having a dedicated database to preserve historical data. With the relatively low cost of storage there’s no reason not to preserve critical data before making destructive changes. Most engineers leave these backups in the Production database. Don’t do this. It’s not production data so it doesn’t belong in the Production database. Backing up to Staging or Development also doesn’t work, as those databases should be occasionally purged. Using a dedicated database is the clearest way to handle this requirement. The schema design should match Production where applicable.

These five databases will handle almost everything in a typical analytics setup. Still, there are a few caveats to watch for before you implement.

Dealing with widely shared data

Not all data will fit neatly into the schema examples above, especially in dimension tables. In those cases, create a schema for widely shared data and store it there. For example, a distribution business might keep dim_product in this shared schema. Unlike other schemas where business groups can take ownership, data engineers should take a more active role in managing this widely shared data.

Avoid deeply nested transformations

With this design, most tables and views should be no more than four or five levels deep. The pattern for data flowing through the system should look something like:

raw_data_table –> stage_data_table –> intermediate_table –> dim/fact table

If you need more complex data analysis, use the most upstream table that contains the required data. Continually adding new fact and intermediate tables downstream can create heavy demands upstream. Having dozens of upstream requirements can slow refreshes, make data quality hard to diagnose, and complicate development unnecessarily.

Unfortunately, deep nesting of transformations is a common problem. They typically start when a new transformation needs data that only exists in another fact table. So, the engineer simply uses the fact table by joining to it, creating an upstream dependency. Then another engineer has the same issue, does the same thing, and creates yet another upstream dependency. These compromises can quickly grow into dozens of layers of dependencies. An otherwise pristine system can become a mess of unmaintainable spaghetti code if this problem is not addressed.

Rather than continually adding new dependencies, consolidate logic in the most upstream intermediate or staging table. In some cases, a new intermediate table is the best way to resolve the dependency issue. Joining to upstream tables results in a flatter design. This approach can save many dependency headaches later.

Put it to practice

Use these rules and you’ll have a flat, well described, and easy-to-understand system. Table usage becomes self-explanatory, and when it’s time to add new tables, everyone will know where they go and what to call them.

Don’t abandon structure for convenience. If the rules don’t apply all the time, they don’t apply at all. If you find something that doesn’t map neatly into this organizational structure, only then should you add new conventions, and only as absolutely needed. Plenty of teams start out organized, then they drown in exceptions. Don’t be one of them.

And if this doesn’t work, you could just put all the tables on your computer’s desktop and call them untitled.sql.