In this lesson we will:
- Discuss the challenges in building streaming solutions;
Why Is Working With Streaming Data Difficult?
As we discussed in the previous lesson, moving from the traditional batch approach towards a real time data streaming architecture is a challenging undertaking.
In this lesson, we will explain in more detail what these challenges are. The streaming technologies that we discuss in this course can be complex, and it is important to understand the problems that we are trying to solve with them.
Streaming platforms need to process and analyse high volumes of event data. Though one stream of events could have a high volume of events, there are likely to be multiple streams all generating data in parralel. An enterprise stream processing platform therefore is likely to need a very high degree of scalability to handle the volumes of data in flight and at rest.
The volume of events in the stream can scale up and down in terms of volume, and may spike during peak hours. Streaming platforms therefore need a capability to scale up and down dynamically to accomodate these changing workloads.
In streaming scenarios, businesses often have some benefit to responding to their event streams in real time. We therefore need to ingest, process and respond to the streams of events with low latency in order to extract maximum value from the data.
Exactly Once Processing
When working with event streams it is important to never lose a message, and never double send or double process a message. We therefore need to build solutions which have a high degree of reliability in how messages are processed, even if some component in the stack was to fail.
It is relatively simple to develop stateless processors which do things such as filter out, route, or add detail to events. However, the complexity grows when we want to look for historical patterns such as “3 failed credit card transactions in the last hour.” To do this, we need to process events by considering their past state, which adds significant complexity into the stack.
The notion of time becomes complex in event processing. Do we care about the time the event happened, the time it was received by the processor, or the time it was stored in the database? In most scenarios, event time is the natural choice, but then we need correct semantics to ensure that we are using the state of the world at the time in question when we come to process the event.
It is important to maintain complete security around personally identifiable and commercially sensitive data. We need to encrypt all stored data in flight and at rest as it moves through the various message queues and processors. This repeated encryption and decryption has impacts on latency and operationally managing the system.
If we needed to implement stream processing from scratch, this would be a very complex undertaking. Fortunately, many tools and platforms that are suitable for stream processing are becoming released and adopted by data teams. These will be discussed in greater detail in the next lesson.