In this lesson we will:
- Common patterns in stream processing platforms;
- Considerations in build vs buy.
Building A Streaming Data Platform
Over the coming years, many businesses are going to be building real-time streaming data platforms in order to deliver against their business objectives.
Components Of A Streaming Platform
Most of the streaming data platforms that businesses will deploy will follow a similar pattern and use very similar technologies:
- Some mechanism for extracting data from source websites and applications, and turning this data into a stream timestamped events;
- Some streaming data engine, which will usually be Kafka or occasionally a cloud-managed service such as Kinesis;
- A stream processing component, such as Kafka Streams, Flink or Spark Streams to pre-process, analyse and aggregate the streaming data;
- A data lake or warehouse to store the post-processed data and make it available for consumption;
- Various means of accessing and analysing the data, including notebooks, application APIs and reporting front-ends in ways more optimised for real time streaming event based data.
Benefits
The benefits of these streaming platforms are huge, including faster time to insight, automation of business processes, lower storage costs, and simplifying the data estate away from batch ETL.
This said, the engineering effort of a real time data platform isn't trivial, and I think much of it will be reinventing the wheel with architectures such as the above being deployed over and over again. For this reason, we need to be careful before implementing yet another Kafka/Data Lake/Spark based data platform.
Here is how we advise companies to go about this journey:
- Firstly, it obviously makes sense where possible to work in the cloud. This way, we avoid the need to build and manage infrastructure and benefit from consumption based pricing and elastic capacity. This is largely a settled question by now.
- Secondly, it makes sense to use cloud managed services and SaaS such as Confluent Cloud for messaging, analytical environments such as Snowflake and Databricks for data processing and storage, and SaaS reporting tools to avoid complexity there.
- Next, many companies are implementing combinations of data lakes and data warehouses to serve different use cases. With the advent of the Data Lakehouse architecture, we can potentially avoid a huge duplication of technology and standardise on one platform.
With careful technology choice and embracing SaaS, significant engineering effort can be saved which can be redirected towards building analytics that actually move the needle for your business.