When developing BI solutions one of the things we constantly struggle with is how to create the best data models. Often enough, a Data Warehouse will start performing slower and slower over time, to the point of smelling a faint burning smell from the server area. Most companies will as a result build a new shinier Data Warehouse that is going to get rid of all the previous problems and is going to be one hundred percent compatible with old solutions.
These projects often lead to failures, finger pointing, “I told you so” and a nice elaborate “lessons learned” document that puts the blame to “act of god” and is ignored by everyone.
While we can talk for long periods about the proper project management (and trust me we will), let’s focus on the basis of these problems first. The Data Model itself.
Most traditional projects start with a structured approach of first analyzing, then designing and then executing. Inmon’s approach naturally follows from this way of working. In Inmon’s way of data modeling (let’s call it the Inmon model or IM from now on) we try to model the data in as much of the same way as the business is structured. We try to follow departments and their closest colleagues, all leading up to a data model that resembles on organogram.
This works great if we’re looking at a small to medium sized company!
Not so great though when the company and thus the data model starts to grow…
So let’s summarize the benefits of this way of modeling:
- Easier for business users because model resembles the business
- Integral solution, so only one version of “the truth”
- Relatively efficient, due to few joins needed for most users
And the obvious down sides:
- Requires a near prophetic vision during planning to make it future viable
- Every change made is highly impactful, potentially even impossible
- Difference between IT application model and IM means large amount of translation
Fortunately, there are other ways of modeling your data, for example using the approach Kimball proposed. Kimball proposed what we now call the “Dimensional Model” (I’m going to call it DM). In this model we have a much smaller emphasis on pre planning and try to make the model as compact as possible.
The most common of applying the DM is by implementing a Star Schema. This means that you have one large “Fact” Table in the middle of your data model, surrounded by a multitude of “Dimension” Tables. The dimension tables are the easiest, since they are literally the dimensions to your data. This means the tables can contain date ranges (day week month, or even time), but also things like product categories, price lists and even product colour descriptions.
The fact table consists of a huge number of rows that all use the information from the dimensions to describe an entity of data. For example, we might look at a bicycle shop having bike sales in their Star Schema. The fact table links to the “product” table to get the right colour, the “data” table for the sales date, etc. Lastly it often adds a few extra columns to it such as a unique ID or some aggregations.
This is the shortest version of it, though very likely I’ll come back to a deeper dive at a later date.
Let’s look at a neat example of what a good functioning Star Schema could look like.
Even in the case of growing business requirements the Star Schema can often grow with it!
But not too much…
To combat this rapid growth, we can do something that we call “Snowflaking” which means that we take some of the dimensions that are related and link them to each other. This violates some of the major principles of the Star Schema (about which some people can be very passionate), but it can sometimes be a pragmatic solution.
But even a snowflake has its limits…
Another way that people have tried to combat these growth issues, is by supplying multiple parallel Star Schema’s. This way every department has their own efficient Data Solution, with their own department specific requirements and adjustments. This works great, until John from accounting and Katie from marketing realize that their respective star schema’s produce different numbers due to all these different business rules.
So, let’s wrap this up again by starting with the benefits:
- Very limited design phase and very early first deliveries
- Easy to add to and very easy to change
- Very efficient due to the lack of multiple table connection paths
And what could possibly lead to some issues:
- Very difficult to scale from a certain point onward
- Multiple Models are advised, which removes the one version of “the truth”
- Model doesn’t enforce staying in line with business organization
Fairly new to the block is Dan Linsted with his Data Vault modeling. While maybe the youngest of the bunch, Data Vault modeling tries to solve some of the problems that the other ways of modeling have and is actually quite good at some of the things!
The Data Vault (or DV from now on) has some similarities with IM, but tries to approach things a bit more agile just like DM. In a DV we have 3 kinds of tables, namely the Hub Sat and Link tables. A Hub is the most basic data of a business process concept, by design only containing a business key and some technical columns. Think of a table named Customers with only a customer id, or a table named Orders with only an Order key.
Now the Link tables are what makes this model work. The link tables (surprise surprise) provide the links between the different tables and only contain this relationship information. The thing that makes the DV so different than the IM however is that a relationship only goes through one path and often follows a natural department independent flow. For example, the Customer table is linked to the Orders table, which is linked to the Products table, which is linked to the Components table. This means I can’t be smart and link Components back to the Customer for a nice and easy shortcut!
Lastly the Satellites are what provides all the detail information and are linked to the Hub table. Think of a table with the specifications of the product, or the shipping details of an order. There is no limit to how many Satellites we can add to a Hub, however sometimes it can be easier to have 4 tables with 10 columns of relevant information than 8 tables with 5.
A small and simple version of a DV could look something like this.
There is no reason to think small though!
Without getting too technical though, this one possible table join path leads to some performance difficulties.
So what is the positive side of this way of working:
- Integral solution, so only one version of “the truth”
- Easy to understand by following business processes rather than organization
- Works with all sizes and easy to adjust along the way
Then why isn’t this the holy grail:
- Getting your data requires very long paths, making queries slow and expensive
- Can’t easily pick a subset of the model
- Requires remodeling when business processes change
If your business is relatively small and it doesn’t change all that much over the years, you can easily go for Inman’s approach. If you just need some cheap and easy data solution and you’re ok with different users having slightly different data, you can trust in the Kimball Model. Lastly if your company is changing and want to work together with the business in managing the data, you can trust in Linsted’s solution.
Me? I like the Data Vault. I like it that I can communicate with my non-data colleagues about my work, without needing to dumb things down. I like it that if I explain some basic querying, most business users can use the solution in its earliest phase. I’m okay spending some more to make everything run quicker and giving my non tech colleagues the tools to help the Data Warehouse development makes me save a bunch on hiring specialized consultants!