In this lesson we will learn more about Data Lakes and their role as part of the Modern Data Stack.
Data Warehouses, discussed in the previous lesson, have historically been the engine of enterprise Data and Analytics platforms. As discussed, the technology and practices around relational Data Warehousing are mature and battle tested, and still meet the vast majority of analytics challenges for the vast majority of businesses.
There are however situations and use cases where traditional Data Warehouses have historically had some downsides and limitations:
- Big Data - Date Warehouses can scale very well, but are not optimised for extremely high volumes of data such as log, machine or clickstream data;
- Unstructured Data - Data Warehouses are very good for structured relational table style data, but have not traditionally been as fully featured when working with semi-structured data such as XML, JSON, free text or video and audio data;
- Ad-Hoc Analysis - The Data Warehouse requires a schema to be designed up-front which is populated from your ingested data. This means that Data Teams have to consider in-advance how people will want to use and consume the data. This could impose limitations on the consumers of the data who want to be able to perform arbitrary ad-hoc analysis;
- Ownership Bottlenecks - The Data Warehouse tends to be owned by a centralised team, which can become a dependency or a bottleneck for Data Analysts and Data Scientists who want to get access to that data.
Modern Data Warehouse Products are tackling all of the above downsides and capability gaps in different ways, such that they are becoming less relevant criticisms in todays world. They are still relevant considerations however, even if designing a Modern Data Stack today.
Data Lakes grew as a concept from 2010 onwards, partly in response to some of the above criticisms, partly due to emerging technology, and partly due to increasing business demands to extract more value from increasingly large and complex datasets.
A Data Lake can be thought of as a place to store all of your raw, unstructured data where it is made available to your business users for their analytics use cases. As per the data warehouse, this Data can be sourced from across your business applications and data sources, and bought into a single repository which is organised and controlled.
In practice, you can think of the Data Lake like a file system, where we have a tree of folders which contain different data files in different formats, potentially including CSVs, JSON, text files, data extracts and audio and video files. The key is that all of this data is raw and unprocessed, usually extracted directly from the source system.
More often than not, this file system representing your Data Lake is hosted in the cloud, using some object store service such as AWS S3 or Azure Blob Store. These object stores are fast, reliable, cheap, globally distributed and easy to secure, so make a great foundation on which to build your data lake.
Where the Data Warehouse is very strong from a structure and governance point of view, the Data Lake doesn't traditionally have these features such as schemas, constraints, access controls, ability to roll back etc. The underlying object store will add some of these to a degree, but not to the same degree as a relational Data Warehouse.
Recognising this, vendors are releasing new tools and capabilites concept of the Data Lake is evolving to add more of these types of features:
- Binary formats such Delta Lake and Apache Iceberg, which add schemas and transactionality to data stored within the Data Lake;
- SQL engines such as Trino which allow us to query files in the Data Lake, directly using SQL or indirectly through e.g. Business Intelligence Tools;
Just as we are seeing Data Warehouses evolve to combat some of their downsides, we are also seeing Data Lakes evolve to combat theirs. The two sides are overlapping and becoming more unified as the Modern Data Stack evolves.
It is clear the Data Lakes and Data Warehouses both have some advantages. Data Lakes are great for storing raw, unstructured data and giving your people freedom to use it in arbitrary ways, whilst Data Warehouses are great for relational data and making it accessible to Data Analysts and Business Users.
With this in mind, many Data Teams go down the route of combining Data Lakes and Date Warehouses, implementing both technologies. Data Initiially flows into some data lake, where it is then consumed and ingested into the Data Warehouse.
Whilst this does offer the best of both worlds, it does have some significant downsides:
- Businesses need to impleemnt and manage both types of technology, which has significant TCO implications
- Two copies of the data need to be stored, which could potentially lead to conflicting information
Noticing this pattern, many vendors are attempting to unify the two technologies such that Data Lakes have more of the benefits and features of the Data Warehouse and vice versa. This is one of the key themes playing out in the industry right now.
In this article we considered Data Lakes and their role in the Modern Data Stack.
We considered how Data Lakes meet some of the historitcal shortcomings of Data Warehouses.
Finally, we discussed how many Data Teams are combining Data Lakes and Data Warehouses, and the downsides of doing this. We also looked at how the two worlds are inreasingly overlapping and unifying in their approach.