In this lesson we will introduce Clickhouse and explain some of the distinctive features which are making it one of the preferred data warehouse solutions for high performance analytical use cases.
Data Warehouses are large centralised databases which are typically used to combine data from multiple line of business applications and data sources. For instance, businesses often need to combine all of the data from sales, marketing, finance and HR into a centralised Data Warehouse for a joined-up view across the entire business.
After ingesting and organising this data, the Warehouse is then usually responsible for exposing it to business stakeholders by serving reports, dashboards, and interactive analytics by data analysts, usually through third party business intelligence tools such as Tableau, PowerBI, or Looker.
Data Warehouses are designed to ingest and store large volumes of data and to be able to serve these business intelligence workloads with high performance. This contrasts with transactional databases such as MySQL or Postgres which are designed for transactional workloads that require high performance when working with small datasets or single rows, but are not optimised for high performance over large data volumes. Sometimes, the terms OLAP (Online Analytical Processing) or OLTP (Online Transaction Processing) respectively.
Clickhouse is an open source, relational, OLAP Data Warehouse which has a focus on query speed and performance.
It was originally developed as an internal project by Yandex, a Russian search engine, to power an internal platform called Metrica. It was then released as open source in 2016, and became sponsored by a commercial entity Clickhouse Inc in 2021.
Though Clickhouse does fit into the definition of a Data Warehouse discussed above, it is perhaps best suited and more commonly found in situations where we have high volumes of low level "event" data and a requirement for performance. Example use cases with this type of data include IOT, Clickstream data, log files or real time market data. These are situations where we have huge volumes of raw data which we would like to analyse in near real-time.
Though it is of course possible to do impleement heavily normalised schemas, batch loads and lots of pre-aggregations and as we would find in traditional Data Warehouses, the Clickhouse approach and philosophy is more about providing enough power and performance to simply to query raw event data.
Though there are a number of analytical databases and data warehouses available in the market, Clickhouse is differentiated in the following ways:
Performance - Clickhouse is known for it's very fast performance. If you need to query and aggregate large volumes of structured relational data, Clickhouse will often be the fastest option available in both the open source and commercial markets;
Open Source - Clickhouse is open source, making it free to download, change and deploy;
Ease Of Deployment - Clickhouse is relatively easy to start with and manage. It is delivered as a single binary which can be started out of the box, configured easily and ran anywhere;
SQL Native - Clickhouse is fully based on ANSI SQL making it more familiar and easier to interact with through APIs and reporting tools. This is in contrast to other competing tools in this space such as Druid or Elastic which are primarily interacted with via a JSON HTTP API;
Clustering - Clickhouse has an elegant model for forming a cluster and distributing query work across the shard with maximmum parallelism. As with the single node deploy, this is also relatively easy to stand up and maintain.
In short, we have a very powerful, fast, open source database which is easy to administer. It's easy to see why Clickhouse is growing in popularity as companies are looking to achieve more with their data and deploying real-time analytical use cases.
In order to achieve such high performance, there are also a few downsides and limitations to Clickhouse:
- Limited Updates and Deletes - Support for updates and deletes of data within Clickhouse is rudimentary. We can allow old data to be removed automatically, but it is not designed for ad-hoc updates and deletions like a transactional database. Other OLAP data stores also have this limitation so this is not unique to Clickhouse;
- No Transactions - Clickhouse does not have the concept of transactions, meaning that data could end up in an inconsistent state if steps are not taken to protect against this;
- Some Management Overhead - You may find yourself comparing Clickhouse with an option such as Snowflake, BigQuery or Redshift. Compared to these, Clickhouse does need to be self-installed and self-managed in a traditional way. This said, Clickhouse Inc are currently focussed on building out the SaaS and IaaS solutions so this could change in future.
In this lesson we introduced Clickhouse and described it's distinguishing features and properties, especially it's emphasis on query performance.
We also considered some of the downsides and tradeoffs of using a database such as Clickhouse, though many of these come from OLAP data stores more generally and not specifically Cikchouse.
In the next lesson we will connect to Clickhouse on our training virtual machine.