Course Overview
Druid For Data Engineers

Introduction To Apache Druid

Lesson #1

In this lesson we will:

  • Introduce Apache Druid;
  • Explain it's unique and differentiated features.

What Is Apache Druid?

Apache Druid is a high performance database that is well suited for supporting real time analytics workloads.

Having initially been developed by an Adtech company before being open sourced, Druid is optimised for performance and scale. It's capabilities include both near real-time ingestion of high volume data streams, and then near real-time interactive analytics over the resulting large datasets.

Druid is a particularly good fit for time series and event based data such as that from IOT devices, clickstream or financial market data.

Notable Features Of Druid

In a crowded Database market, Apache Druid has broken through by meeting a number of requirements very well:

  • It has very high performance characteristics. For instance, Druid is the database which underlies Netflix IOT monitoring solution, ingesting up to 200 million events per second;
  • This performance allows clients such as reporting front-ends to interactively explore their real time datasets immediately as data is ingested. This makes it ideal as a real-time operational tool where users need to understand what is happening right now;
  • As well as the focus on real-time workloads, Druid also supports analysing historical data, and has a model which allows you to run analytics over a combination of both historical and current data;
  • Long term historical data can be stored on a cheaper storage tier whereas more recent data can be held on high performance servers or in memory for faster interactive querying, reducing running costs;
  • Druid is open source, making it free to deploy and modify. Commercial support options are however available, primarily by Imply who are the main commercial and technical contributors of the project;
  • It is cloud native, allowing you to create instances across a large cluster and add or remove capacity as you need to scale. If one of your servers breaks, full resilience is also guaranteed;
  • It is very easy to integrate with, providing both JSON based HTTP APIs and a SQL access layer.

Druid is often said to sit at the intersection of multiple classes of database, giving us features and characteristics found in OLAP, relational, time series and graph databases. This makes it a fairly unique proposition compared to all of the other databases in the market.

Business Benefits Of Druid

From a business perspective, Druid provides a number of benefits:

  • Arm your people or your systems with up the minute information about what is happening in your business, in situations where time is of the essence;
  • Run complex analytics over either real time or historical data, particularly in an interactive exploratory environment;
  • Process huge volumes of real time event data;
  • Work with time oriented data, understanding what happened over time using time series analysis;
  • Move away from heavyweight properietary databases to something open source, cloud native and easy to integrate with;
Next Lesson:
01

Druid Architecture

An overview of Druids Microservice architecture.

0h 15m



Continuous Delivery For Data Engineers

This site has been developed by the team behind Timeflow, an Open Source CI/CD platform designed for Data Engineers who use dbt as part of the Modern Data Stack. Our platform helps Data Engineers improve the quality, reliability and speed of their data transformation pipelines.

Join our mailing list for our latest insights on Data Engineering:

Timeflow Academy is the leading online, hands-on platform for learning about Data Engineering using the Modern Data Stack. Bought to you by Timeflow CI

© 2023 Timeflow Academy. All rights reserved