Course Overview
Clickhouse For Data Engineers

MergeTree Engine Family

Lesson #6

In this lesson we will:

  • Introduce the MergeTree table engine.

About MergeTree

When we create a table in Clickhouse we need to choose an engine which is responsible for storing and querying the data.

Different table engines are suitable for different use cases, data patterns and access patterns. It is therefore important to know broadly how they work and when to choose which.

This feature is fairly unique to Clickhouse, so it's a new concept to grasp and decision to make even for experienced data engineers and DBAs.

The MergeTree table engines are the most robust and heavily used engines of the choices available. There are actually a number of table engines in the same family, all that work slightly differently. What they have in common is that they accept inserts, but then work behind the scenes to organise and merge the rows to optimise the table data.

Types Of MergeTree

There are five table engines in the MergeTree family:

  • Merge Tree - The basic table which works by accepting data and then
  • Replacing Merge Tree - This will replace rows with more recent rows
  • Summing Merge Tree - This will combine rows and sum up data for instance we can aggregate by hour

Using The Merge Tree

The table engine is specified at the time of table creation using the ENGINE parameter. Let's go ahead and create three tables with three different engines which we will later user to explain the concept:

CREATE TABLE CUSTOMERS ENGINE=MergeTree()

CREATE TABLE CUSTOMERS ENGINE=ReplacingMergeTree()
CREATE TABLE CUSTOMERS ENGINE=SummingMergeTree()

Merge Tree Exploration

When you create data in a Clickhouse table, the rows are stored in segments.

create table

Replacing Merge Tree

In some tables, we are only interested in the most recent value of a particular data item. For instance, if we have a table of bank balances, there will be many instances where we only care about the current value and not the billions of historical values across the customer basis. In this situation, the Replacing Mergre Tree could help by dropping old values each time a new value is inserted for us.

We can begin by creating a table with the ReplacingMergeTree engine. We need to specify a PRIMARY KEY to describe which objects are replaced. To illustrate the point, we will use a compound primary key which combines two fields.

create table

We can insert data.

insert into orders values ( 1, 2, 3 );
insert into orders values ( 1, 2, 3 );
insert into orders values ( 1, 2, 3 );

A lot of the MergeTree operations work behind the scenes. Though there are no guarantees, issuing the optimize

optimize table orders

And with a quick check we can see that no data has been merged:

select * from orders

We will now insert a row which we expect to be replaced, using the same order_id and customer_id.

insert into orders values ( 1, 2, -99 );

Optimise the table again:

optimize table orders

And with one more check we will see that the latest order has been replaced:

select * from orders

Summing Merge Tree

The Summing Merge Tree works by adding up values. This is sometimes referred to as Rolling Up.

An example use case might be adding up the total value of orders for a given product

create table

We can insert data.

insert into orders values ( 1, 2, 3 );
insert into orders values ( 1, 2, 3 );
insert into orders values ( 1, 2, 3 );
Next Lesson:
06

Materialised Views

In this lesson we will introduce materialised views, and demonstrate how they are more powerful than materialsied views as implemented by other databases.

0h 15m



Continuous Delivery For Data Engineers

This site has been developed by the team behind Timeflow, an Open Source CI/CD platform designed for Data Engineers who use dbt as part of the Modern Data Stack. Our platform helps Data Engineers improve the quality, reliability and speed of their data transformation pipelines.

Join our mailing list for our latest insights on Data Engineering:

Timeflow Academy is the leading online, hands-on platform for learning about Data Engineering using the Modern Data Stack. Bought to you by Timeflow CI

© 2023 Timeflow Academy. All rights reserved