In this lesson we will:
- Learn about the architecture of Apache Druid;
- Explain how Druid can store data in cheaper, longer term "Deep Storage";
- Learn about some of the considerations when deploying Apache Druid.
Druid Architecture
Though Apache Druid is a powerful database, it's architecture is slightly more complex than alternatives.
An Apache Druid deployment consists of a set of processes or Microservices working together in a co-ordinated fashion.
The advantage is that the components can be scaled independently. For instance, if you have a requirement which has very frequent ingestion, but relatively few queries, you may wish to add more capacity to the ingestion components and have fewer query services.
The five classes of service include:
- Coordinator - Manages where data is stored in the cluster
- Overlord - Assigns workloads to different processes in the cluster
- Broker -
- Router - Routes requests from clients to the appropriate endpoint in the Druid cluster
- Historical -
- Middlemanager - Responsible for the ingestion process into the Druid database.
In a small deployment, these could all be deployed onto the same server. However, in reality we would likely scale the deployment across multiple machines.
A common deployment pattern is to place the XX and the YY together to handle ingestion, and the AA and the BB together to handle querying.
Segments
Druid stores it's data into segment files which are organised by time. When we create an index, we specify a granularity such as by minute, hour or day. Our data is then organised into blocks sized according to the chosen granularity.
In the example below, we have three segments from the 1st January 2022, one for 1am-2am, one for 2am-3am, and one for 3am-4am.
marketdata_2022-01-01/2022-01-02_v1_0
marketdata_2022-01-02/2022-01-03_v1_1
marketdata_2022-01-03/2022-01-04_v1_2
Deep Storage
Apache Druid gives us some control over where the data is physically stored.
By default, Data is stored on the servers local storage. This would typically be a high performance SSD, giving us the most rapid querying but also the most expensive storage costs.
After some time has passed, we may then choose to age out the data so it is stored on local storage on the server.
After that, we may age out to store the data on AWS S3, Azure or HDFS. These sources are typically slower, but a cheaper place to store historical data which may not be accessed as frequently.