Lesson Overview

In this lesson we will:

  • Introduce Apache Kafka;
  • Cover some of it's common use cases;
  • Highlight it's key features and properties;
  • Discuss the open source and commercial distributions and managed service options.

What Is Apache Kafka?

Apache Kafka is a data streaming platform which allows you to publish, distribute and consume streams of data with high performance, scalability and reliability.

An example use case for Kafka might be distributing the latest prices of stocks on a stock exchange to thousands of mobile application clients in real time. Kafka provides the data exchange and messaging capability for use cases like this where speed, scalability and reliability are essential.

If some ways, Kafka can be thought of as an evolution of traditional message brokers such as Tibco, IBM MQ or RabbitMQ. However, Kafka is much more scalable and performant than previous generations of messaging technology, and has some important architectural evolutions which we will cover in this course.

Kafka is very widely deployed and is by far the leading platform for streaming data integration in use in industry today. It is therefore an important area of knowledge for aspiring Data Engineers looking to build modern real-time data platforms.

Common Use Cases For Kafka

Though Kafka can be used for many diverse data requirements, some of the most common use cases include:

  • Real Time Streaming e.g. Streaming real time data from server processes to web or mobile client applications or vice versa;
  • SOA or Microservice Integration e.g. Integrating services which need to exchange data or actions to complete some business process;
  • Data Exchange e.g. Communicating data between systems where multiple copies are required;
  • ETL - e.g. Taking data from a source to a destination data repository such as from your application into your Data Lake or Data Warehouse;
  • Real Time BI & Analytics e.g. Calculating metrics and analytics that allow you to monitor the state of your business in real time.

Data integration scenarios like this occur across all industries. For instance, ecommerce, stock exchanges, IOT and online advertising are all likely to have business requirements in this sphere and will likely be using Kafka to build their solutions.

Notable Features Of Kafka

Though there are many ways to exchange data between systems, Kafka offers a number of features and characteristics which together make it very compelling as a platform:

  • Real Time - Kafka can distribute messages from producer to consumers in real time, potentially immediately after the data is created at the source. This is referred to as streaming data, and contrasts with a lot of data technology which is based on infrequent batch exchange of data;
  • Performance - Kafka can move messages from source to destination with very low latency, typically in the order of milliseconds;
  • Scalability - Kafka can accept thousands of connections from publishers and consumers and handle all of the connections in a very resource efficient manner. Kafka can also be scaled by adding multiple servers in to a cluster where extra scale is required;
  • Reliability - Kafka can be used in a way such that messages are never lost, and are always delivered exactly once. It will also handle a range of failure scenarios and situations such as consumers and producers that temporarily crash and need to recover from where they left off;
  • Ordering - Kafka introduces semantics whereby we can ensure that events are processed and received in order. This can be an important property that you rely upon which could make applications that you develop simpler;
  • Audit - Kafka can provide a central point for auditing and recording data events. It can be configured to retain data for a set periods, giving it similar properties as a database and a useful log as to what actually happened;
  • Security - Kafka introduces additional security controls such as the ability to encrypt data in transit.

Though we could attempt to implement data exchange and distribution without Kafka, it would be very complex and require significant engineering to achieve a similar level of capability using a more bespoke approach.

Open Source vs Commercial Distribution

Apache Kafka is an open source platform and free to download, modify and deploy.

There are however commercially supported and managed distributions and services, such as those from Confluent, who are the main commercial and technical supporters behind the development of Kafka.

Of particular note is Confluent Cloud, which is a fully managed Software-As-A-Service platform for Kafka. Using Confluent Cloud avoids the need for you to configure and run your own Kafka clusters and simply use Kafka as a service with a consumption based billing model.

Cloud Providers such as AWS and Aiven also provided fully managed Kafka services if you wish to avoid the effort and overhead of running your own cluster.

Next Lesson

In the Next Lesson we will learn some of the central architectural concepts and terminology associated with Kafka.

Hands-On Training For The Modern Data Stack

Timeflow Academy is an online, hands-on platform for learning about Data Engineering and Modern Cloud-Native Database management using tools such as DBT, Snowflake, Kafka, Spark and Airflow...

Sign Up

Already A Member? Log In

Next Lesson:

Core Concepts Of Kafka

Prev Lesson:

Running The Training Environment

© 2022 Timeflow Academy.