Lesson Overview

In this lesson we will:

  • Learn about batch processing;
  • Learn about stream processing;
  • Cover the difficulties in moving from a batch to a streaming model.

Batch Data Exchange

There are many situations where businesses need to move data from one location to another. For example:

  • Copying data from one application into another e.g. eCommerce into a fullfilment system
  • Copying data into a repository e.g. a Data Warehouse or a Data lake for later analysis
  • Copying data for a partner organisation e.g. Sending orders to a third party warehouse who will fulfil the order
  • Copying data into a location for backup, archive or audit purposes

Traditionally, businesses have done this using a batch model, whereby every hour, day or week, data.

From Batch To Streaming

Streaming technology involves moving data from source to destination as a continuous stream, typically immediately after it has been created.

By doing this, business can get instant visibility and response to situations much earlier.

Challenges In Stream Processing

The idea of processing a stream of events to generate business value is a relatively simple one to understand. We also believe it is a compelling idea which can deliver genuine business value and improvements to the customer experience.

However, to deliver on this technically quickly becomes complex.

Some of the key challenges you will encounter include:

Exactly Once Processing If we are processing a stream of events it is important to never drop a message, and never double send a message. In some scenarios such as web analytics, we might be relaxed about the occasional dropped message, or even choose to optimise latency over correctness, but in most business scenarios, correctness is much more important. Exactly once processing is most difficult in the case of failure of a server or a component in the system, where we need complete confidence over which messages have been processed when the system is restored.

Stateful Processing

It is relatively simple to develop stateless processors which do things such as filter out, route, or add detail to messages. However, the complexity grows when we want to look for historical patterns such as “3 failed credit card transactions in the last hour.” To do this, we need to maintain a history of events, which needs to be stored in memory for fast lookups and persisted to disk to enable exactly once processing. We also need to get the right data to the right nodes who are performing lookups against this dataset.


In a real time complex event scenario, we typically need to respond as fast as possible. As well as optimising the path through the system, one of the main tools in our toolbox to do this is to add parallelism into the message queues and processors. The problem is that parallelism opens up the system to race conditions and events being processed out of order. So as we parallelise we have to use constraints, locks and integrity checks which are the enemy of latency. To develop a low latency, non blocking stream processing solution which maintains correct semantics is not simple.

Time Semantics

The notion of time gets interesting in event processing. Do we care about the time the event happened, the time it was received by the processor, or the time it was stored in the database? In most scenarios, event time is the natural choice, but then we need correct semantics to ensure that we are using state at the time in question when we come to process the event, and not giving it a glimpse into the future. This becomes fiendishly complicated when we want to process out of order and late arriving events, because then we potentially need to go back and process old messages because the state at the time has actually moved.


Event processing can be very bursty, perhaps providing a lot of data as a batch every hour or even at end of day. We therefore need a capability to scale up and down as necessary to accomodate these changing workloads.


It is important to maintain complete security around personally identifiable and commercially sensitive data. We need to encrypt all stored data in flight and at rest as it moves through the various message queues and processors. This repeated encryption and decryption has impacts on latency and operationally managing the system.

Data Volumes

Event processing can create enormous volumes of data which is difficult to ingest and analyse. In the realm of stateful complex event processing, we also need to store some of this in memory and at processing nodes.

Event Processing Frameworks

To make stream processing easier, various event processing frameworks and platforms have been created including Kafka Streams, Flink, Spark Streams and Google Cloud Dataflow.

These are individually powerful frameworks which help developers to write and operate correct, performant stream processors where much of the above complexity is abstracted away from you.

Hands-On Training For The Modern Data Stack

Timeflow Academy is an online, hands-on platform for learning about Data Engineering and Modern Cloud-Native Database management using tools such as DBT, Snowflake, Kafka, Spark and Airflow...

Sign Up

Already A Member? Log In

Next Lesson:

What Is Data Engineering?

Prev Lesson:

Data Lakes In The Modern Data Stack

© 2022 Timeflow Academy.