In this lesson we will:
- Introduce the concept of streaming data;
- Compare the traditional batch data approach with streaming data;
- Introduce some of the foundational concepts associated with streaming data.
About Streaming Data
Streaming data is data that is generated continuously and in high volumes. Common examples of this class of data include stock price updates, click stream events, logs, and data from IOT devices.
Streaming data is often, but not always machine generated, and usually consists of a high volume of relatively small events that are published immediately after it has been generated at the source.
Though streaming data can be very valuable to businesses, processing and analysing it is challenging. Primarily, this is because the quantity and speed that this data is genearated is beyond the scalability limits of the tools that most data teams use today.
Beyond the volume of data, businesses would often like to use their streaming data in sophisticated ways and in real time, as there is often some commercial or operational benefit to doing so. For instance, fraud detection, algorithmic trading and preventitive maintence are all examples where processing streaming data in real time has business benefit.
Considering the challenges and high demands around working with streaming data, new approaches, tools and platforms are required. Though these are emerging today, the field is still in it's relative infancy.
Evolving From Batch To Streaming
Most data platforms deployed within businesses today are batch based. This means that data is ingested and processed in batches of multiple records, typically on some schedule such as hourly or daily cycles.
Though batch based data exchange is simple, its major downside is that it implies a delay before data is processed and gets into the hands of business users.
Though this is acceptable in many situations, businesses increasingly want to process their data in real time for either operational use cases or to improve their customer experience.
Streaming data is the answer for improving this situation, whereby we move from periodic processing of data in batches, towards continuously processing data as it is generated.
The challenge however is that modernising from a batch based architecture to a streaming platform is not simple. Most large businesses have a significant dependency on legacy data systems which have been designed around batch processing. They will likely have to implement a new generation of tools and infrastructure to succesfully work with streaming data. And unfortunattely, they will not necessarily have the experience of working with streaming data in house.
Organisations undertaking this journey from batch data processing to streaming architectures is likely to be a key theme for data teams in the coming years, and Data Engineers with experience in this field will likely be in high demand.