Contributed by: Wilfried Lemahieu, Seppe vanden Broucke, Bart Baesens
This article first appeared in Data Science Briefings, the DataMiningApps newsletter. Subscribe now for free if you want to be the first to receive our feature articles, or follow us @DataMiningApps. Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail over at briefings@dataminingapps.com and let’s get in touch!
This article is based on our upcoming book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.
Many organizations nowadays are struggling with finding the appropriate data stores for their data. In this article, we zoom in on some key data structures to facilitate corporate decision making by means of business intelligence. More specifically, we discuss data warehouses, data marts, operational data stores and data lakes and indicate their differences and similarities.
A data warehouse was first formally defined by Bill Inmon as [1]:
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.”
Subject-oriented implies that the data is organized around subjects such as customers, products, sales, etc. The data warehouse is integrated in the sense that it integrates data from a variety of operational sources and a variety of formats such as RDBMSs, legacy DBMSs, flat files, etc. Time variant refers to the fact that the data warehouse essentially stores a time series of periodic snapshots. Operational data is always up-to-date and represents the most recent state of the data elements, whereas a data warehouse is not necessarily up-to-date but represents the state at some specific moment(s) in time. Non-volatile implies that the data is primarily read-only, and will thus not be frequently updated or deleted over time. Hence, the two most important types of data manipulation operations for a data warehouse are: data loading and data retrieval.
A data mart is a scaled down version of a data warehouse aimed at meeting the information needs of a homogeneous small group of end-users such as a department or business unit (e.g., marketing, finance, logistics, HR, etc.). It typically contains some form of aggregated data and is used as the primary source for report generation and analysis by this end-user group. There are various reasons for setting up data marts. First of all, they provide focused content such as finance, sales or accounting information, in a format tailored to the user group at hand. They also improve query performance by offloading complex queries, and therefore workload, from other data sources (e.g., a data warehouse). Data marts can be located closer to the end-users, hereby alleviating heavy network traffic and giving them more control. Finally, certain reporting tools assume predefined data structures which can be provided by a customized data mart. In order to denote the contrast with a data mart, a full blown data warehouse is often called an enterprise data warehouse (EDW) to emphasize the organization wide aspect.
An Operational Data Store (ODS) is another way of dealing with the disadvantage of data warehouses not containing up to date data. An ODS can be considered as a staging area that provides query facilities. A normal staging area is only meant for receiving the operational data from the transactional sources for the sake of transforming the data and loading it into the data warehouse. An ODS also offers this functionality, but in addition it can be queried directly. In this way, analysis tools that need data that is closer to real time, can query the ODS data as it is received from the respective source systems, before time consuming transformation and loading operations. The ODS then only provides access to the current, fine grained and non-aggregated data, which can be queried in an integrated manner without burdening the transactional systems. However, more complex analyses requiring high-volume historical and/or aggregated data are still conducted on the actual data warehouse.
The data lake concept became known as part of the big data and analytics trend. Although both data warehouses and data lakes are essentially data repositories, there are some clear differences as listed in the table below. A key distinguishing property of a data lake is that it stores raw data in its native format, which could be structured, unstructured or semi-structured. This makes data lakes fit for more exotic and ‘bulk’ data types that we generally do not find in data warehouses, such as social media feeds, clickstreams, server logs, sensor data, etc. A data lake collects data emanating from operational sources ‘as is’, often without knowing upfront which analyses will be performed on it, or even whether the data will ever be involved in analysis at all. For this reason, either no or only very limited transformations (formatting, cleansing, …) are performed on the data before it enters the data lake. Consequently, when the data is tapped from the data lake to be analyzed, quite a bit of processing will typically be required before it is fit for analysis. The data schema definitions are only determined when the data is read (schema-on-read) instead of when the data is loaded (schema-on-write) as is the case for a data warehouse. Storage costs for data lakes are also relatively low because most of the implementations are open-source solutions that can be easily installed on low-cost commodity hardware. Since a data warehouse assumes a predefined structure, it is less agile compared to a data lake which has no structure. Also, data warehouses have been around for quite some time already, which automatically implies that their security facilities are more mature. Finally, in terms of users, a data warehouse is targeted towards decision makers at middle and top management level, whereas a data lake requires a data scientist, which is a more specialized profile in terms of data handling and analysis.
Data Warehouse | Data Lake | |
Data | Structured | Often unstructured |
Processing | Schema-on-write | Schema-on-read |
Storage | Expensive | Low cost |
Transformation | Before entering the DW | Before analysis |
Agility | Low | High |
Security | Mature | Maturing |
Users | Decision makers | Data Scientists |
To summarize, in this article we highlighted the differences and similarities between data warehouses, data marts, operational data stores and data lakes. All these data structures clearly serve different purposes and user profiles and it is important to be aware of their differences in order to make the right investment decisions.
[1] W.H. Inmon, Building the Data Warehouse, 2nd edition, John Wiley & Sons, New York, 1996.
For more information, we are happy to refer to our book Principles of Database Management: The Practical Guide to Storing, Managing and Analyzing Big and Small Data, www.pdbmbook.com.
Do you also wish to contribute to Data Science Briefings? Shoot us an e-mail (just reply to this one) and let’s get in touch!