In this lesson we will:

  • Common patterns in stream processing platforms;
  • Considerations in build vs buy.

Building A Streaming Data Platform

Over the coming years, many businesses are going to be building real-time streaming data platforms in order to deliver against their business objectives.

Components Of A Streaming Platform

Most of the streaming data platforms that businesses will deploy will follow a similar pattern and use very similar technologies:

  • Some mechanism for extracting data from source websites and applications, and turning this data into a stream timestamped events;
  • Some streaming data engine, which will usually be Kafka or occasionally a cloud-managed service such as Kinesis;
  • A stream processing component, such as Kafka Streams, Flink or Spark Streams to pre-process, analyse and aggregate the streaming data;
  • A data lake or warehouse to store the post-processed data and make it available for consumption;
  • Various means of accessing and analysing the data, including notebooks, application APIs and reporting front-ends in ways more optimised for real time streaming event based data.

Benefits

The benefits of these streaming platforms are huge, including faster time to insight, automation of business processes, lower storage costs, and simplifying the data estate away from batch ETL.

This said, the engineering effort of a real time data platform isn't trivial, and I think much of it will be reinventing the wheel with architectures such as the above being deployed over and over again. For this reason, we need to be careful before implementing yet another Kafka/Data Lake/Spark based data platform.

Here is how we advise companies to go about this journey:

  • Firstly, it obviously makes sense where possible to work in the cloud. This way, we avoid the need to build and manage infrastructure and benefit from consumption based pricing and elastic capacity. This is largely a settled question by now.
  • Secondly, it makes sense to use cloud managed services and SaaS such as Confluent Cloud for messaging, analytical environments such as Snowflake and Databricks for data processing and storage, and SaaS reporting tools to avoid complexity there.
  • Next, many companies are implementing combinations of data lakes and data warehouses to serve different use cases. With the advent of the Data Lakehouse architecture, we can potentially avoid a huge duplication of technology and standardise on one platform.

With careful technology choice and embracing SaaS, significant engineering effort can be saved which can be redirected towards building analytics that actually move the needle for your business.

Next Lesson:
05

Stream Processing vs Real Time Data Warehouses

In this lesson we will constrast stream processing with performing real time analytics in a data warehouse.

0h 15m



Continuous Delivery For Data Engineers

This site has been developed by the team behind Timeflow, an Open Source CI/CD platform designed for Data Engineers who use dbt as part of the Modern Data Stack. Our platform helps Data Engineers improve the quality, reliability and speed of their data transformation pipelines.

Join our mailing list for our latest insights on Data Engineering:

Timeflow Academy is the leading online, hands-on platform for learning about Data Engineering using the Modern Data Stack. Bought to you by Timeflow CI

© 2023 Timeflow Academy. All rights reserved