In this lesson we will introduce Apache Kafka and cover some of the core concepts and use cases of the platform.
Kafka is a distributed data streaming platform which allows you to publish and consume data with high performance, scalability and reliability.
An example use case for Kafka might be a server calculating the latest prices of stocks on a stock exchange, and distributing the price updates to thousands of mobile clients in real time. Kafka is the messaging technology commonly used to fulfill streaming data use cases such as this where speed, scalability and reliability are essential.
For those who are familiar with messaging technology, Kafka can be thought of in a similar way to traditional message brokers such as Tibco products or RabbitMQ. However, Kafka is also much more scalable and performant and there are some important architectural evolutions over previous generations of this technology which we hope to bring to life in these lessons.
Kafka is very widely deployed and is by far the leading platform for data integration in use in industry today. It is therefore an important area of knowledge for aspiring Data Engineers whichever area of the stack you are working in.
Kafka is open source and totally free to download, modify and deploy.
There are however commercially supported and managed distributions and services from Confluent, who are the main commercial and technical drivers being the development of Kafka.
Confluent also provide Confluent Cloud, which is a fully managed Software-As-A-Service platform for Kafka. This avoids the need for you to configure and run your own broker clusters.
Though Kafka can be used in many Data integration scenarios, some of the most common use cases for it include:
- Real Time Streaming e.g. Streaming real time updates from server processes to web or mobile client applications;
- SOA or Microservice Integration e.g. Integrating back-end services such as triggering an email when a customer places an order;
- Data Exchange e.g. Communicating data and events from one business to a partner to fullfil some business process;
- ETL - e.g. Taking data from a source to a destination data repository such as from your application into your Data Lake or Data Warehouse;
- Real Time BI & Analytics e.g. Calculating metrics and analytics that allow you to monitor the state of your business in real time.
Data integration scenarios like this occur across all industries. For instance, ecommerce, stock exchanges, IOT data, online advertising are all likely to have business requirements in this sphere. Again, Kafka is appropriate for any situation where we need to distribute data reliabily and with high performance.
In this lesson we learnt about Kafka and it's potential use cases for streaming data with high performance, stability and reliability.
We then discussed common use cases, including integration scenarios, ETL and data and analytics.
In the next lesson, we will learn some of the central concepts and terminology associated with Kafka.