Many data professionals are wondering about the difference between the Data Lake and the Data Warehouse. Some do not understand the meaning or the use of these two notions in the storage of data in Big Data. In this column, we will define the Data Warehouse, the Data Lake. Then we will show the position of each of these notions in the data valuation chain and finally we will present the notable differences between them so that you are able to identify Data Lake or Data Warehouse, the strategy which is more appropriate. for your Big Data project.
In essence, the activity of a company is multi-process. In other words, to achieve its management objective, the company needs to divide its main activity into several business processes, for example, the purchasing process, the sales process, the operating process, the human resources and many more. Each of these processes generates operational data which is most often captured either from a specialized application such as an ERP (Enterprise Resource Planning) such as Oracle, SAP, PeopleSoft, or through Excel spreadsheets (or flat files, csv, etc.). This results in a “data siloing” which prevents management from having a global vision of the company’s activity: the finance application captures finance data, the logistics application captures data generated by the Logistics process, the HR application captures the data generated by HR etc. As these applications are distinct, specific to each business process, the company ends up with several different databases.
The Data Warehouse: the central data collection point
The Data Warehouse is the first solution for integrating data from these different applications (or databases). Integration is an IT strategy that consists of transferring all the data produced by applications into a single directory so that the users and other company systems can access it. To put it simply, integrating applications means homogenizing and providing a unified view of the data they possess. As simple as this solution seems, it poses two major problems: first, integrating the data that comes from different applications is very complex, because a very strong coupling translated by point-to-point connections exists between these different applications. The applications are not completely independent, they are interfaced to exploit each other’s data as shown in the following figure.
Second, the integration of these applications requires making them all compatible with a large number of APIs or exchange formats, which is not always possible since the applications are often proprietary and therefore compatible with specific APIs. Beyond the problem of data silos, there is also the governance aspect of this data. Indeed, with the multiple uses that are made of data by users, come higher stakes, in particular security, legal, integration and compliance issues, both for the company and for the individual. We hear every day in the press or on the Internet of cases of companies having suffered the hacking of their computer system or their data due to a careless error of the employees.
Faced with these risks, companies need to control and control the use of data both inside and outside the company. This therefore requires defining data quality policies, data security policies, data acquisition policies, data compliance policies with current legislation (Sarbanes Oxley, Patriot Act, CNIL, GDPR, etc. ) and ensure data is used according to company policies and guidelines. Governance therefore necessarily implies the elimination of data silos in the organization, and a unified and homogeneous view of all the company’s data. It would therefore be necessary to leave the diagram of figure 1 for the diagram of the figure below.
All applications store their data in a single directory where this data is homogenized, which allows all applications to have access to it. This directory also allows other systems in the company to be able to use all the data they need without having to resort to the application that created it.
From then on, the Data Warehouse acts as a central repository and as a framework for standardizing all the company’s data. Its goal is to provide a unified and homogeneous vision of all the data of the company. It centralizes all the company’s operational data in an RDBMS and organizes it into business subjects, which are made available to the corresponding business users. Business topics are made available to the corresponding business users. For example, the finance subject groups all finance-related data, which is then made available to financial analysts.
Formally, the Data Warehouse, as a concept, was invented in 1980 by Bill Inmon. This formally defines it as: “a collection of subject-oriented, integrated, constant-variance and non-volatile data used for strategic decision-making”:
- A data collection: i.e. a central integration point for all the company’s operational data contained in the operational applications (ERP, business applications, Excel spreadsheets, CRM, SCM, flat files, csv, etc). This central point does not necessarily refer to the centralization of data in a specific physical location or in a specific physical machine, the idea is to succeed in unifying and connecting all the data of a company using a common repository;
- Subject-oriented: the data stored in the Data Warehouse is not grouped according to the needs of business processes, but according to subjects, for example a subject of customer, salesperson, production, locality, etc.;
- Integrated: since data comes from different sources or systems, and is therefore often structured and coded in different ways, the Data Warehouse integrates them to provide a uniform, consistent and transparent representation;
- At constant variance: the Data Warehouse stores data in the form of history and thus introduces the notion of time into data storage;
- Non-volatile: data serialized in the Data Warehouse is read-only, it cannot be modified or changed. The only possible operation in the Data Warehouse is adding data;
- Used for strategic decision-making: The Data Warehouse, with this storage approach, serves as the foundation for data analysis and effective decision-making. It helps answer strategic questions like, “What are the best-selling products in each region, and how do demographics impact those sales results?” What is the turnover achieved by product line in the previous three quarters? What explains the decline in turnover over the previous quarter? »
Thus, as a concept, the Data Warehouse serves as a point of integration of operational data and also a point of distribution of this data in the hands of business users. The idea of the Data Warehouse is:
- to be corporate (have a corporate perimeter), i.e. to be the point of arrival of all the data generated by the company and to be the starting point of all the applications for exploiting data in the company ;
- to be as resilient to change as possible: we do not update the data in a Data Warehouse, we only add to it (the variance of the data is the same);
- given that it has a corporate scope, it must store massive volumes of data in a very reasonable time;
- to be independent of processing techniques, and independent of computer technologies;
- to be stable. Its stability is based on the fact that in business, generally what changes the most are technologies and applications. This means that you can build a Data Warehouse relying exclusively on business processes, which are generally subject to standardization. Thus, we would have a Data Warehouse for CRM, for accounting, for finance, for commerce, for human resources, etc. This is possible because business processes are not subject to rapid change.
From Data Warehouse to Data Lake
Despite the robustness of the Data Warehouse, it is no longer adapted to the current context. Indeed, the Digital Age is characterized by the growth of different types of stored data assets. The stored data is no longer just structured ERP data, but data as diverse as web server activity logs, call center call logs, social media data that combines content text, images, audio and video, surveillance center videos, sensor data etc. All of this data is generated at very high speed and stored in unstructured formats. These new data sources create new pressures for rapid acquisition, absorption, and analysis while maintaining consistency across all existing data assets. Where in a structured environment, there were clear patterns for the acquisition, exchange, analysis and restitution of data, today the company must rethink its data management approaches to extract a usable signal from all new data sources.
In addition, there is an increasing demand for real-time integration of analytical results. There are more and more people with an expanding variety of roles who are customers of analytical results. The growth is especially noticeable in companies where business processes are augmented to fully integrate statistical models to optimize their performance.
As an example, a large retail company can monitor real-time sales of thousands of SKUs across hundreds of outlets, and minute-by-minute logs of sales. Delivering these massive volumes of data to a community of different users for simultaneous analysis opens up new insights and capabilities that have never existed before: enabling shoppers to appreciate buying trends in order to make more accurate decisions about products catalogue ; product specialists to consider alternate ways of packaging products together; inventory management professionals to increase traffic more easily in warehouses or stores; Pricing professionals to instantly adjust the prices of the different points of sale, among others. The use of data analysis techniques in the current context requires real-time access to data and therefore the ability of the storage system to deliver the data as quickly as possible. The Data Warehouse was designed for the storage of structured data and does not provide real-time access to this data.
As we said earlier, to solve these challenges, the most appropriate approach is to distribute data storage and parallelize their processing on a Hadoop cluster. The technologies used to implement the Data Warehouse do not allow distributed storage or parallelism of user requests. Thus, companies must migrate all their data to a Hadoop cluster to provide a unified processing framework and rapid access to data.
With the drop in data storage costs and the cost of computers, Hadoop can be considered as a unified point of access to data and HDFS presents itself as the most profitable option both in terms of performance and in terms of financial costs for data storage and processing. The HDFS can be used as a central repository to archive various data sources and formats, whether structured or unstructured. In the Data Warehouse, the data must be stored according to a multidimensional model and the analyzes that can be done on the data must be predefined.
However, the fact that HDFS does not require data to be stored according to a particular data model promotes the capture of new data sources and makes it more flexible for data analysis. This approach where Hadoop is used as the single point of access to all company data is called Data Lake. We are going to define the notion of Data Lake and you will see that this concept does not have the same meaning as that which is commonly held and which you may intuitively think of.
Data Lake Definition
The phrase “Data Lake” was coined by James Dixon, the CTO of Pentaho as of the publication date of this article. Pentaho is a Big Data solution software publisher. Like any concept that emerges, the Data Lake has been assigned a multitude of definitions, for the majority as erroneous as each other. No one knows a concept better than its author. We are therefore going to define the Data Lake not according to what we can intuitively think, but according to the way James Dixon defines it and as any definition is always given in a context, we are going to come back to the context in which he created it.
The majority of business applications are essentially computer implementations of business processes (or workflow applications) or machine states. These are CRMs, ERPs, SCMs (Supply Chain Management), call center software, finance applications, or other business applications. Real-world entities or objects like employees, customers, orders represented in these applications are stored as collections of attributes that define their state at a specific time. Attributes are commonly referred to as fields or columns. As attributes of these objects, we can cite the name of the customer, his postal address, the date of the order, the quantity purchased, and so on. It is the value of these attributes at the time they are consulted that is technically called machine state or state. We will come back to this notion of machine state later.
Machine states are very effective when it comes to answering questions regarding the state of things, in other words, the machine state allows for reporting analyses, which indicate the situation of the object at a particular moment or which indicate what happened. However, when it is necessary to do predictive analyzes or analyzes that provide a medium or long term vision of the object, it becomes more complicated. Normally, to be able to perform predictive analyses, it is necessary to trace the different states of the attributes in a history called a log. This log generally takes the form of classic relational tables or CSV files that list the history of changes in the state of the attributes of the object over time. In a Data Warehouse for example, the history of real world entities is collected in tables called fact tables. This way the log can be aggregated for a set of attributes of one or more system objects to get a view of them over time. However, history is not always available for all objects. In general, the business chooses one or more objects of interest and it is for these objects that the history is collected.
Thus, if you suddenly want to analyze the evolution of the attributes of an object which initially had no interest (and therefore for which the history has not been collected), it will simply be impossible. Unfortunately, this situation is very common with the Data Warehouse approach. Specifically, building data analytics applications in the Data Warehouse approach follows the following pattern: users submit a set of questions they want the Data Warehouse to answer, a copy of the data regarding the attributes needed to answer to these questions is extracted from the Data Warehouse to constitute a “Data Mart”, and it is from this Data Mart that they obtain the answers to their questions. Each Data Mart is the Data Warehouse of a business process and any data mining application is built on a Data Mart.
This is where the Data Lake approach comes in. The Data Lake solves this problem in the following way: by taking advantage of the falling cost of computers, all of the company’s data is stored in a centralized repository hosted on a cluster of convenient computers. Then, Data Warehouse Data Marts are powered to satisfy traditional demands and enable ad-hoc queries and reporting on data lake data for new questions. Thus, the idea of the Data Lake is to historicize all the objects of the company in order to allow all kinds of predictive and descriptive analyses. The Data Lake lifts the limitation of pre-establishing the types and models of data analysis that could be done and built on the Data Warehouse by logging only certain attributes. The Data Lake retains and logs all the attributes of all the objects of the company. Thus, it is not just a question of storing all the data of the company in Hadoop, the Data Lake is not a catch-all of data. In the Data Warehouse, archiving the data of objects that are not relevant for decision-making is costly both in financial and technological terms and useless (since they do not meet the needs of users at that precise moment) . Falling IT costs now make it possible to store everything cheaply at a higher level of depth and granularity to cover future data analysis needs.
As you have seen, the Data Lake substantially increases the potential for data analysis. However, although its architecture based on Hadoop as a single point of access to the data of the whole company seems ideal, it poses a problem in its implementation: Hadoop, or rather the HDFS on which is based Hadoop is a system batch, unsuitable for rapid ingestion of continuous data streams and write/read operations on these streams. This problem is not unique to Hadoop. Even the Data Warehouse does not provide direct and fast access to data. In addition, the data must be transformed according to the rules of normalization which guarantee the homogeneity of the data whether in the Data Lake or in the Data Warehouse. In both cases, to solve this problem, an intermediate layer is added between the Data Lake/Data Warehouse and the operational sources of data. In the Data Warehouse approach, this middle layer is called an ETL.
ETL, Extract Transform & Load
ETL, Extract Transform & Load is the process in the IT environment which is responsible for extracting data from operational sources (EXTRACT), transforming them so that they comply with the Data Warehouse homogenization rules ( TRANSFORM) and load them into the Data Warehouse (LOAD). It is usually compared to the kitchen of a restaurant, where the ingredients are transformed into tasty dishes before being served to the customer. The ETL operates as a Batch system and by default loads new data into the Data Warehouse every night. In some businesses, an additional component can be added to the system to improve the Data Warehouse data refresh rate, this component is called the ODS, Operational Data Store. The ODS is an operational database that is frequently updated with operational data, usually to meet the needs of urgent data analysis.
Unfortunately, despite its effectiveness, ETL cannot be used as a connection bridge in a Data Lake approach for two main reasons:
- Batch processing: ETL is a Batch process that loads data periodically, it cannot be used to load data that arrives in a continuous stream. In its very principle, the ETL is composed of a set of routines (functions and procedures) which transform the data, manage its quality and apply a series of normalizations to the data before loading it in batches into the Data Warehouse;
- Normalizations: the idea of the Data Lake is to provide the history of all the states of the different attributes of the entities handled by the company. The Data Warehouse requires data to be normalized through ETL to provide a consistent and unified view of all enterprise data. This homogenization leads in most cases to a loss of information necessary for Reporting and detrimental for predictive analyses. Moreover, it is because of this that statistical learning studies are not based on the data loaded into the Data Warehouse, but directly on the data from the operational source;
As part of the processing of data generated by streaming, for these limitations of the ETL, market players have replaced the ETL with streaming ingestion tools. Kafka and Flume are two examples of such tools.
To succeed in ingesting data in a continuous flow and delivering it efficiently in the Data Lake, ingestion tools rely on two particular elements: a particular data structure called the Log and a Publish/Subscribe messaging system.
ABOUT LONDON DATA CONSULTING (LDC)
We, at London Data Consulting (LDC), provide all sorts of Data Solutions. This includes Data Science (AI/ML/NLP), Data Engineer, Data Architecture, Data Analysis, CRM & Leads Generation, Business Intelligence and Cloud solutions (AWS/GCP/Azure).
For more information about our range of services, please visit: https://london-data-consulting.com/services
Interested in working for London Data Consulting, please visit our careers page on https://london-data-consulting.com/careers