Lesson Overview

In this lesson we will:

  • Common patterns in stream processing platforms;
  • Considerations in build vs buy.

Growth In Streaming Data

Over the next few years, many businesses are going to be building real-time streaming data platforms based on data lakes and data warehouses.

The aim of these initiatives will be to improve customer service and business efficiency, by more rapidly processing and responding to data.

Most of these event streaming architectures will follow a similar pattern and use very similar technologies:

  • Some mechanism for extracting data from source websites and applications, and turning this data into a stream timestamped events;
  • Some streaming data engine, which will usually be Kafka or occasionally a cloud-managed service such as Kinesis;
  • A stream processing component, such as Kafka Streams, Flink or Spark Streams to pre-process, analyse and aggregate the streaming data;
  • A data lake or warehouse to store the post-processed data and make it available for consumption;
  • Various means of accessing and analysing the data, including notebooks, application APIs and reporting front-ends in ways more optimised for real time streaming event based data.


The benefits of these streaming platforms are huge, including faster time to insight, automation of business processes, lower storage costs, and simplifying the data estate away from batch ETL.

This said, the engineering effort of a real time data platform isn't trivial, and I think much of it will be reinventing the wheel with architectures such as the above being deployed over and over again. For this reason, we need to be careful before implementing yet another Kafka/Data Lake/Spark based data platform.

Here is how we advise companies to go about this journey:

  • Firstly, it obviously makes sense where possible to work in the cloud. This way, we avoid the need to build and manage infrastructure and benefit from consumption based pricing and elastic capacity. This is largely a settled question by now.
  • Secondly, it makes sense to use cloud managed services and SaaS such as Confluent Cloud for messaging, analytical environments such as Snowflake and Databricks for data processing and storage, and SaaS reporting tools to avoid complexity there.
  • Next, many companies are implementing combinations of data lakes and data warehouses to serve different use cases. With the advent of the Data Lakehouse architecture, we can potentially avoid a huge duplication of technology and standardise on one platform.

With careful technology choice and embracing SaaS, significant engineering effort can be saved which can be redirected towards building analytics that actually move the needle for your business.

Next Lesson

Description of next lesson here

Hands-On Training For The Modern Data Stack

Timeflow Academy is an online, hands-on platform for learning about Data Engineering and Modern Cloud-Native Database management using tools such as DBT, Snowflake, Kafka, Spark and Airflow...

Sign Up

Already A Member? Log In

Next Lesson:

Real Time Stream Processing On AWS

Prev Lesson:

Stream Processing vs the Data Warehouse

© 2022 Timeflow Academy.