By: Bart Baesens, Seppe vanden Broucke
This QA first appeared in Data Science Briefings, the DataMiningApps newsletter. Also want to submit your question? Just Tweet us @DataMiningApps. Want to remain anonymous? Then send us a direct message and we’ll keep all your details private. Subscribe now for free if you want to be the first to receive our articles and stay up to data on data science news, or follow us @DataMiningApps.
You asked: What are surrogate keys? Why would you use them in a data warehouse instead of using the business keys from the operational systems?
Our answer:
Surrogate keys are typically meaningless integers used to connect the fact to the dimension tables of a data warehouse. There are various reasons why we cannot simply reuse our existing natural or business keys. First, business keys usually have a business meaning in the OLTP systems, such as social security number for Employee and VAT number for Company. Hence, they are tied to the business setting and requirements and if these would change (e.g., due to a merger or acquisition, a new legislation) then all tables using those keys need to be updated, which may be a resource intensive operation in a data warehouse environment, because not only the current state is stored, but also historical data. Surrogate keys essentially buffer the data warehouse from the operational environment by making it immune to any operational changes. They are used to relate the facts in the fact table to the appropriate rows in the dimension tables, with the business keys only occurring in the (much smaller) dimension tables to keep the link with the identifiers in the operational systems.
Furthermore, when compared to surrogate keys, business keys are usually bigger in size which will result in big indexes and slow down index traversal and, consequently, query execution time. Therefore, using surrogate keys will save space and improve performance. This is especially true for the fact table, since most of its attribute types are foreign keys. For example, if the data warehouse contains data about 20,000 customers, who on average made 15 purchases, then the fact table will contain about 300,000 (small) surrogate key values, whereas the dimension table will contain 20,000 (large) business key values in addition to the same number of surrogate key values. Without surrogate keys, the fact table would contain 300,000 business key values.
Next, business keys are also often re-used over longer periods of time. For example, prodnr ‘123abc’ may be a different product now than 5 years ago. Hence, they cannot be used as primary keys in a data warehouse storing multiple snapshots of the data across longer time periods. Finally, surrogate keys can also be successfully used to deal with slowly changing dimensions.