The Importance of the Surrogate Key in a well design data model

A couple of years ago, when I started working in the data warehouse world, the company I was working  for hired a consultant firm to help us to start building our data warehouse. And, as part of the contract, the consultants would also deliver a set of dashboards and reports. When the first set of reports came to us, we decided to take a look on the code (after all the product being used by the company was completely new for us, and we thought it was a good way to learn more from the experts). After having checked the scripts that generate and maintained the employee dimension, we noticed that the dimension key was built concatenating the date when the record was created and the employee id, which indeed is not a good practice (shame on you experts).

But, why concatenate the creation date and source primary key is so bad? Well, query performance is the first reason that comes out. Due to different types of values are being used to generate the key, it forces you to use VARCHAR data types to accommodate all the values. Indexing these type of fields requires more space and also affects the query response time.

Second, this also leads to another bad practice "simplify" the control of changes across time. Instead, of checking which attributes have been modified during a day, and only insert one record when something has changed (Slowly Changing Dimension) some prefer the use of these composite keys stating that you have a snapshot of that record for that day so you don't need to worry about any additional logic. For instance, the ETL that maintained our employee dimension at that time, added a record per employee each day, now imagine the amount of records that we had in that dimension three months later.

To avoid these bad design practices, the use of the surrogate key is strongly recommended. A surrogate key consist in a numeric value with no additional meaning used to identify a record in a dimension. Because, numbers are used we get rid of the issue of query performance and also due to it's assigned during the ETL processing it has nothing to do with the original source system, which helps to keep the data warehouse isolated. Also, surrogate keys helps to keep tracking of any change in our dimension. Instead of having a record per day per employee, we only insert a new record when any of the employee attributes change, the main benefit  is the reduction in the amount of records in our dimension.

In summary, avoid the trap of creating composite keys, instead use a surrogate key to identify your records in your dimensions. 

No comments:

Post a Comment