In this lesson we will:

  • Learn about the architecture of Apache Druid;
  • Explain how Druid can store data in cheaper, longer term "Deep Storage";
  • Learn about some of the considerations when deploying Apache Druid.

Druid Architecture

Though Apache Druid is a powerful database, it's architecture is slightly more complex than alternatives.

An Apache Druid deployment consists of a set of processes or Microservices working together in a co-ordinated fashion.

The advantage is that the components can be scaled independently. For instance, if you have a requirement which has very frequent ingestion, but relatively few queries, you may wish to add more capacity to the ingestion components and have fewer query services.

The five classes of service include:

  • Coordinator - Manages where data is stored in the cluster
  • Overlord - Assigns workloads to different processes in the cluster
  • Broker -
  • Router - Routes requests from clients to the appropriate endpoint in the Druid cluster
  • Historical -
  • Middlemanager - Responsible for the ingestion process into the Druid database.

In a small deployment, these could all be deployed onto the same server. However, in reality we would likely scale the deployment across multiple machines.

A common deployment pattern is to place the XX and the YY together to handle ingestion, and the AA and the BB together to handle querying.

Segments

Druid stores it's data into segment files which are organised by time. When we create an index, we specify a granularity such as by minute, hour or day. Our data is then organised into blocks sized according to the chosen granularity.

In the example below, we have three segments from the 1st January 2022, one for 1am-2am, one for 2am-3am, and one for 3am-4am.

marketdata_2022-01-01/2022-01-02_v1_0
marketdata_2022-01-02/2022-01-03_v1_1
marketdata_2022-01-03/2022-01-04_v1_2

Deep Storage

Apache Druid gives us some control over where the data is physically stored.

By default, Data is stored on the servers local storage. This would typically be a high performance SSD, giving us the most rapid querying but also the most expensive storage costs.

After some time has passed, we may then choose to age out the data so it is stored on local storage on the server.

After that, we may age out to store the data on AWS S3, Azure or HDFS. These sources are typically slower, but a cheaper place to store historical data which may not be accessed as frequently.

Next Lesson:
02

Emitting Druid Metrics To Kafka

How to emit operational metrics from Druid into Apache Kafka for monitoring purposes.

0h 15m



Continuous Delivery For Data Engineers

This site has been developed by the team behind Timeflow, an Open Source CI/CD platform designed for Data Engineers who use dbt as part of the Modern Data Stack. Our platform helps Data Engineers improve the quality, reliability and speed of their data transformation pipelines.

Join our mailing list for our latest insights on Data Engineering:

Timeflow Academy is the leading online, hands-on platform for learning about Data Engineering using the Modern Data Stack. Bought to you by Timeflow CI

© 2023 Timeflow Academy. All rights reserved