Course Overview
Clickhouse For Data Engineers

Introduction To Clickhouse

Lesson #1

In this lesson we will:

  • Introduce Clickhouse.
  • Explain some of the differentiating and notable features.

What Is Clickhouse?

Clickhouse is an open source, relational, OLAP Data Warehouse which has a focus on high performance.

It was originally developed as an internal project by Yandex, a Russian search engine, to power an internal platform called Metrica. It was then released as open source in 2016, and became sponsored by a commercial entity Clickhouse Inc in 2021.

Though Clickhouse does fit into the data warehousing category, it is perhaps best suited and more commonly found in situations where we have high volumes of low level time oriented "event" data and a requirement to analyse this data with very high performance.

Example use cases with this type of data include IOT, clickstream data, log files or real time market data. These are situations where we have huge volumes of raw data which we would like to ingest rapidly and at scale, and analyse in near real-time.

Differentiating Features Of Clickhouse

Though there are a number of analytical databases and data warehouses available in the market, Clickhouse is differentiated in the following ways:

  • Performance - Clickhouse is known for it's high performance. If you need to query and aggregate large volumes of event based or time series data, Clickhouse will often be the fastest option available in both the open source and commercial markets;

  • Open Source - Clickhouse is open source, making it free to download, change and deploy;

  • Ease Of Deployment - Clickhouse is relatively easy to start with and manage. It is delivered as a single binary which can be started out of the box, configured easily and ran anywhere;

  • SQL Native - Clickhouse is fully based on ANSI SQL making it more familiar and easier to interact with through APIs and reporting tools. This is in contrast to other competing tools in this space such as Druid or Elastic which are primarily interacted with via a JSON HTTP API.

In short, we have a very powerful, fast, open source database which is easy to administer. It's easy to see why Clickhouse is growing in popularity as companies are looking to achieve more with their data and deploying more real-time analytical use cases.

Tradeoffs When Using Clickhouse

In order to achieve such high performance, there are also a few tradeoffs and limitations associated with Clickhouse:

  • Limited Updates and Deletes - Support for updates and deletes of data within Clickhouse is rudimentary. We can allow old data to be removed automatically, but it is not designed for ad-hoc updates and deletions as we would use a transactional database for;
  • No Transactions - Clickhouse does not have the concept of transactions, meaning that data could end up in an inconsistent state if steps are not taken by the administrator protect against concurrent access issues;
  • Management Overhead - You may find yourself comparing Clickhouse with an option such as Snowflake, BigQuery or Redshift. Compared to these, Clickhouse does need to be self-installed and self-managed in a traditional way, though this is changing with the recent general availability of Clickhouse Cloud;
  • Clustering and Sharding - Though Clickhouse is frequently deployed as a cluster, administering this is a more manual undertaking than other cloud native databases which can perform automatic clustering and rebalancing of data.

To be fair to Clickhouse, many of these tradeoffs are also found in other OLAP databases, which are optimised for certain analytical use cases at the expense of others. As is always the case in the data world, it is important to choose the right tool for your particular use case.

Next Lesson:

Running and Connecting To Clickhouse

In this lesson we will start the Clickhouse Server and connect to it from the Clickhouse client using the command line interface.

0h 10m

Continuous Delivery For Data Engineers

This site has been developed by the team behind Timeflow, an Open Source CI/CD platform designed for Data Engineers who use dbt as part of the Modern Data Stack. Our platform helps Data Engineers improve the quality, reliability and speed of their data transformation pipelines.

Join our mailing list for our latest insights on Data Engineering:

Timeflow Academy is the leading online, hands-on platform for learning about Data Engineering using the Modern Data Stack. Bought to you by Timeflow CI

© 2023 Timeflow Academy. All rights reserved