In this lesson we will:
- Introduce Clickhouse.
- Explain some of the differentiating and notable features.
What Is Clickhouse?
Clickhouse is an open source, relational, OLAP Data Warehouse which has a focus on high performance.
It was originally developed as an internal project by Yandex, a Russian search engine, to power an internal platform called Metrica. It was then released as open source in 2016, and became sponsored by a commercial entity Clickhouse Inc in 2021.
Though Clickhouse does fit into the data warehousing category, it is perhaps best suited and more commonly found in situations where we have high volumes of low level time oriented "event" data and a requirement to analyse this data with very high performance.
Example use cases with this type of data include IOT, clickstream data, log files or real time market data. These are situations where we have huge volumes of raw data which we would like to ingest rapidly and at scale, and analyse in near real-time.
Differentiating Features Of Clickhouse
Though there are a number of analytical databases and data warehouses available in the market, Clickhouse is differentiated in the following ways:
Performance - Clickhouse is known for it's high performance. If you need to query and aggregate large volumes of event based or time series data, Clickhouse will often be the fastest option available in both the open source and commercial markets;
Open Source - Clickhouse is open source, making it free to download, change and deploy;
Ease Of Deployment - Clickhouse is relatively easy to start with and manage. It is delivered as a single binary which can be started out of the box, configured easily and ran anywhere;
SQL Native - Clickhouse is fully based on ANSI SQL making it more familiar and easier to interact with through APIs and reporting tools. This is in contrast to other competing tools in this space such as Druid or Elastic which are primarily interacted with via a JSON HTTP API.
In short, we have a very powerful, fast, open source database which is easy to administer. It's easy to see why Clickhouse is growing in popularity as companies are looking to achieve more with their data and deploying more real-time analytical use cases.
Tradeoffs When Using Clickhouse
In order to achieve such high performance, there are also a few tradeoffs and limitations associated with Clickhouse:
- Limited Updates and Deletes - Support for updates and deletes of data within Clickhouse is rudimentary. We can allow old data to be removed automatically, but it is not designed for ad-hoc updates and deletions as we would use a transactional database for;
- No Transactions - Clickhouse does not have the concept of transactions, meaning that data could end up in an inconsistent state if steps are not taken by the administrator protect against concurrent access issues;
- Management Overhead - You may find yourself comparing Clickhouse with an option such as Snowflake, BigQuery or Redshift. Compared to these, Clickhouse does need to be self-installed and self-managed in a traditional way, though this is changing with the recent general availability of Clickhouse Cloud;
- Clustering and Sharding - Though Clickhouse is frequently deployed as a cluster, administering this is a more manual undertaking than other cloud native databases which can perform automatic clustering and rebalancing of data.
To be fair to Clickhouse, many of these tradeoffs are also found in other OLAP databases, which are optimised for certain analytical use cases at the expense of others. As is always the case in the data world, it is important to choose the right tool for your particular use case.