In this lesson we will:
- Learn more about Data Lakes and their role as part of the Modern Data Stack;
- Compare Data Warehouse and Data Lake approaches;
- Introduce the concept of a Data Lake.
Data Warehouses, discussed in the previous lesson, have historically been the engine of enterprise Data and Analytics platforms. As discussed, the technology and practices around relational Data Warehousing are mature and battle tested, and still meet the vast majority of analytics challenges for the vast majority of businesses.
There are however situations and use cases where traditional Data Warehouses have historically had some downsides and limitations:
- Big Data - Date Warehouses can scale very well, but are not optimised for extremely high volumes of data such as log, machine or clickstream data;
- Unstructured Data - Data Warehouses are very good for structured relational table style data, but have not traditionally been as fully featured when working with semi-structured data such as XML, JSON, free text or video and audio data;
- Ad-Hoc Analysis - The Data Warehouse requires a schema to be designed up-front which is populated from your ingested data. This means that Data Teams have to consider in-advance how people will want to use and consume the data. This could impose limitations on the consumers of the data who want to be able to perform arbitrary ad-hoc analysis;
- Ownership Bottlenecks - The Data Warehouse tends to be owned by a centralised team, which can become a dependency or a bottleneck for Data Analysts and Data Scientists who want to get access to that data.
Modern Data Warehouse Products are tackling all of the above downsides and capability gaps in different ways, such that they are becoming less relevant criticisms in todays world. They are still relevant considerations however, even if designing a Modern Data Stack today.
Data Lakes grew as a concept from 2010 onwards, partly in response to some of the above criticisms, partly due to emerging technology, and partly due to increasing business demands to extract more value from increasingly large and complex datasets.
A Data Lake can be thought of as a place to store all of your raw, unstructured data where it is made available to your business users for their analytics use cases. As per the data warehouse, this Data can be sourced from across your business applications and data sources, and bought into a single repository which is organised and controlled.
In practice, you can think of the Data Lake like a file system, where we have a tree of folders which contain different data files in different formats, potentially including CSVs, JSON, text files, data extracts and audio and video files. The key is that all of this data is raw and unprocessed, usually extracted directly from the source system.
More often than not, this file system representing your Data Lake is hosted in the cloud, using some object store service such as AWS S3 or Azure Blob Store. These object stores are fast, reliable, cheap, globally distributed and easy to secure, so make a great foundation on which to build your data lake.
Where the Data Warehouse is very strong from a structure and governance point of view, the Data Lake doesn't traditionally have these features such as schemas, constraints, access controls, ability to roll back etc. The underlying object store will add some of these to a degree, but not to the same degree as a relational Data Warehouse.
Recognising this, vendors are releasing new tools and capabilites concept of the Data Lake is evolving to add more of these types of features:
- Binary formats such Delta Lake and Apache Iceberg, which add schemas and transactionality to data stored within the Data Lake;
- SQL engines such as Trino which allow us to query files in the Data Lake, directly using SQL or indirectly through e.g. Business Intelligence Tools;
Just as we are seeing Data Warehouses evolve to combat some of their downsides, we are also seeing Data Lakes evolve to combat theirs. The two sides are overlapping and becoming more unified as the Modern Data Stack evolves.
It is clear the Data Lakes and Data Warehouses both have some advantages. Data Lakes are great for storing raw, unstructured data and giving your people freedom to use it in arbitrary ways, whilst Data Warehouses are great for relational data and making it accessible to Data Analysts and Business Users.
With this in mind, many Data Teams go down the route of combining Data Lakes and Date Warehouses, implementing both technologies. Data Initiially flows into some data lake, where it is then consumed and ingested into the Data Warehouse.
Whilst this does offer the best of both worlds, it does have some significant downsides:
- Businesses need to impleemnt and manage both types of technology, which has significant TCO implications
- Two copies of the data need to be stored, which could potentially lead to conflicting information
Noticing this pattern, many vendors are attempting to unify the two technologies such that Data Lakes have more of the benefits and features of the Data Warehouse and vice versa. This is one of the key themes playing out in the industry right now.
For more than 30 years, Data Warehouses have been a central part of the business intelligence landscape. This pattern typically involved bringing structured data together from across the business into a centralised location for business intelligence reporting and analysis. For instance, banks often implemented large data warehouses to combine details from their marketing systems, accounting systems, and mortgage systems to build a single view of the customer. Data Warehouses were often SQL based and built on proprietary technology such as Oracle or SQL Server.
More recently, around 2010, the idea of the Data Lake emerged. Whereas Data Warehouses typically structure and organise their data into a set relational schema as it is ingested, the Data Lake pattern involves storing data which is unstructured or semi structured and leaving decisions about how the data will be processed and analysed until later when it is consumed. Data Lakes are also a better fit for data such as images, log files or other binary files and not just relational data. Finally, instead of being stored within a relational database as per the Warehouse, Lakes are typically stored on low cost Object Storage such as AWS S3 where the data is more open.
This evolution in approach has led to a situation where organisations have both Data Warehouses and Data Lakes within their organisation, and indeed multiple instances of each. Sometimes, the Lake is simply the more modern, strategic and widely used solution, whereas the Warehouse is legacy. In other instances, the two are integrated so that data is exchanged between the two, for instance by surfacing Lake data to Data Analysts through a relational Warehouse.
Recently, there has been considerable interest in combining the concepts into what is referred to a single Data Lakehouse. This involves taking the best of the Data Warehouse and Data Lake and delivering them together with one technology solution and one copy of the data. This is a fairly significant technology ask, but if it can be delivered, it has significant benefits:
- Only one technology platform needs to be deployed and managed, reducing cost, overhead and improving time to value;
- This shared data source can be used by Data Analysts and Data Scientists who can adopt common tooling and patterns;
- All data can be stored in the more modern lake structure, avoiding much of the painful ETL development, whilst also giving data analysts the organised and structured interface they need;
In practice, the Data Lakehouse pattern can be thought of as providing a layer over the top of the unstructured Lake, which makes it look like more a Data Warehouse. For instance, by adding the following capabilities over a Data Lake:
- Introducing the concept of relational tables over the unstructured data files;
- Exposing a SQL layer and query engine for querying and updating this data;
- Implementing database mechanisms such as constraints on top of the data;
- Implementing ACID transactions over the data lake.
Again, Databricks have the lead in developing this capability of integrating the Data Warehouse and Data Lake, but they are by no means the only vendor on this journey. Snowflake for instance is moving away from marketing itself as a Data Warehouse to becoming more of a Data Platform, whilst the Azure Synapse Analytics is also bringing this to life in an Azure native solution.
Data Lakehouse can sound like a marketing buzzword, but there is real substance behind it and I believe it is likely to be one of the big transformation stories for data strategy in the enterprise going forward.
In this article we considered Data Lakes and their role in the Modern Data Stack.
We considered how Data Lakes meet some of the historitcal shortcomings of Data Warehouses.
Finally, we discussed how many Data Teams are combining Data Lakes and Data Warehouses, and the downsides of doing this. We also looked at how the two worlds are inreasingly overlapping and unifying in their approach.