In this lesson we will provider a more in depth overview of Snowflakes architecture.
The Snowflake architecture is an evolution over traditional data warehouses, modernised to take advantage of properties of cloud.
Some of the notable features of this architecture include:
As discussed, Snowflake is a fully Software-As-A-Service platform. This means that there is nothing to deploy and manage on your own servers or in your own cloud accounts.
Historically, businesses have had concerns about placing their data in the cloud, so have tended to build Data Warehousing infrastructure on premise. In recent years, they had been tentatively moving towards solutions such as AWS RedShift or GCP BigQuery hosted in cloud infrastructure. However, moving to a full Software-As-A-Service model is a step even beyond this which Snowflake succesfully brought to market.
Though Snowflake is a fully SaaS solution, behind the scenes it is running in one of the major cloud providers infrastructure - AWS, Azure or GCP.
Though this hosting is somewhat abstracted away from the Snowflake user or administrator, it means that Snowflake can take advantage of the inherent properties of the cloud such as it's massive scalability, elasticity and consumption based pricing model.
Separation of Storage and Compute is one of the most commonly cited benefits of Snowflake and one of the key things to understand about it's architecture.
In a traditional database such as Oracle or MySQL, storage and compute were historically tied together. If we wanted more storage, we would add another node which both stored data locally and could be used to serve queries. In other words, storage and compute was deployed together in a tightly integrated way. This was due to the low latency requirements of transactional systems and slow or unreliable network connectivity with remote storage.
Over the years, a few things have changed with regards to this picture. Firstly, compute and network performance has improved to the extent that we can now realistically store data across the network from the compute nodes in order to decouple the two. With data warehousing workloads such as Snowflake, we can also tolerate some increased latency of reading over the network. This makes decoupling storage from the compute viable.
What this seperation means is that that the storage and the compute can scale independently. For instance, we could have an extremely large dataset which takes petabytes of data, and a very small data warehouse. On the flip side, we could have a a very small dataset and a very large processing tier to meet the needs of our business. This processing tier could then of course grow and shrink multiple times throughout the day, independently of the storage. This allows businesses to completely rightsize their databases, and save significant costs in doing so.
Traditionally, there would be one Data Warehouse process which all users connect to to issue their queries and analytics.
With Snowflake, we can create multiple virtual warehouses which all point at the single shared dataset.
A common deployment model is to give different business units such as marketing, finance or sales their own virtual warehouse. These can then be scaled to suit the needs, and can be scaled up and down in line with their own usage patterns. For example, maybe finance need more horsepower during month end processes.
The ability to start and stop multiple virtual warehouses, and have them billed on a consumption basis is a very powerul and flexible idea.
This arrangement of separating out data into a different tier which is then shared by multiple warehouses is referred to as a "Shared Disk" architecture. At the same time, the practice of running multiple virtual warehouses is sometimes referred to as a Shared Nothing architecture. This hybrid of shared disk and shared nothing is really the secret sauce of Snowflake which is supporting it's rapid adoption.
The Snowflake Architecture is best thought of in tiers like so:
At the bottom level, we have the storage tier. This is provided by the cloud provider, through a service such as AWS S3 or Azure Blob Storage. As a Snowflake user or administrator, you won't be directly interacting with these services, though it's definitely worth having an appreciation of these services, how they work, the reliability guarantees they provide etc. As a Snowflake administrator, you may also be loading data from or extracting data to these cloud services.
At the second level, we have a cluster of compute nodes which are responsible for executing the queries, data manipulations and other processing. This cluster is referred to as a Virtual Warehouse, and will typically serve one user or group of users. It is also possible and typical to create multiple Virtual Warehouses, each of which may be long lived, or be created for a short period of time to execute a particular task.
At the top tier, we have a set of cloud services which manage things such as authentication, security and access control, metadata etc. These services are internal to Snowflake, but again it's worth understanding what these services are responsible for as occasionally they will touch on your day to day work.
In this lesson we learnt about Snowflakes architecture.
We introduced key characteristics such as the seperation of storage and compute, and described how Snowflake benefits from running in cloud environments such as AWS or GCP as the underlying hosting tier.
We introduced a tiered model of architecture of storage, compute and cloud services which is a useful mental model when discussing and learning Snowflake.