Lesson Overview

In this lesson we will:

  • Learn about the architecture of Apache Druid;
  • Explain how Druid can store data in cheaper, longer term "Deep Storage";
  • Learn about some of the considerations when deploying Apache Druid.

Druid Architecture

Though Apache Druid is a powerful database, it's architecture is slightly more complex than alternatives.

An Apache Druid deployment consists of a set of processes or Microservices working together in a co-ordinated fashion.

The advantage is that the components can be scaled independently. For instance, if you have a requirement which has very frequent ingestion, but relatively few queries, you may wish to add more capacity to the ingestion components and have fewer query services.

The five classes of service include:

  • Coordinator - Manages where data is stored in the cluster
  • Overlord - Assigns workloads to different processes in the cluster
  • Broker -
  • Router - Routes requests from clients to the appropriate endpoint in the Druid cluster
  • Historical -
  • Middlemanager - Responsible for the ingestion process into the Druid database.

In a small deployment, these could all be deployed onto the same server. However, in reality we would likely scale the deployment across multiple machines.

A common deployment pattern is to place the XX and the YY together to handle ingestion, and the AA and the BB together to handle querying.

Segments

Druid stores it's data into segment files which are organised by time. When we create an index, we specify a granularity such as by minute, hour or day. Our data is then organised into blocks sized according to the chosen granularity.

In the example below, we have three segments from the 1st January 2022, one for 1am-2am, one for 2am-3am, and one for 3am-4am.

marketdata_2022-01-01/2022-01-02_v1_0
marketdata_2022-01-02/2022-01-03_v1_1
marketdata_2022-01-03/2022-01-04_v1_2

Deep Storage

Apache Druid gives us some control over where the data is physically stored.

By default, Data is stored on the servers local storage. This would typically be a high performance SSD, giving us the most rapid querying but also the most expensive storage costs.

After some time has passed, we may then choose to age out the data so it is stored on local storage on the server.

After that, we may age out to store the data on AWS S3, Azure or HDFS. These sources are typically slower, but a cheaper place to store historical data which may not be accessed as frequently.

Hands-On Training For The Modern Data Stack

Timeflow Academy is an online, hands-on platform for learning about Data Engineering and Modern Cloud-Native Database management using tools such as DBT, Snowflake, Kafka, Spark and Airflow...

Sign Up

Already A Member? Log In

Next Lesson:

Anomaly Detection Using Data Stored In Apache Druid

Prev Lesson:

Introduction To Apache Druid

© 2022 Timeflow Academy.