In this lesson we will:
- Introduce DBT, the leading tool for Data Transformation within the Modern Data Stack.
- Explain the value that DBT brings to data teams, and introduce some of the core concepts.
DBT stands for Data Build Tool. It is an open source tool that is increasingly being deployed within modern data stacks to carry out data transformations.
Data transformations involve taking some source data and changing it in some way, into some other format or derived structure and/or applying calculations to it. Example transformations include:
- Cleaning the data - e.g. removing duplicated data or changing it into a standardised format;
- Summarising and aggregating the data - e.g. turning individual order data into daily summaries;
- Combining and joining data - e.g. building a joined up view from multiple source datasets;
- Carrying out rudimentary statistical analysis - e.g. to highlight trends and enable insights into the data.
Of course, reports, dashboards and other data consumers could be built on top of raw and unprocessed data directly to turn it into the format they need. However, it usually makes sense to carry out transformations earlier in the process and closer to the data warehouse. This makes the data that is ultimately exposed to the business much cleaner, better structured and more controlled and consistent. This data preperation within the warehouse is the role that DBT plays.
Companies have been doing this type of Extract, Transform and Load (ETL) work around their data warehouse for many years, and it's a very mature field with established patterns and tooling. However, DBT's approach is really a huge step forward in modernising the process and making it much more efficient and robust as we will see throughout this course.
Before discussing DBT in more detail, it is worth considering some of the problems with the existing approach and ETL tooling. Some of the most command challenges include:
- ETL processes are often fragile, breaking easily for instance if presented with unexpected data;
- ETL specific tools are often GUI based and proprietary, requiring special skills and license fees;
- ETL processes are usually managed by centralised data teams which become a bottleneck and source of delay;
- ETL changes required by the business often end up being slow to implement which delays downstream consumers;
- The schemas within data warehouses are often strongly defined and controlled and the ETL scripts become tightly bound to these;
- The ETL approach does not account for technical best practices such as testability, traceability, modularity etc;
- Source data is not bought into the warehouse in it's raw and unprocessed format for later analysis. This data is therefore not easy for analysts to work with.
In short, existing approaches to ETL lack speed, quality, agility and flexibility. At a time when businesses are looking to achieve more with their data, better approaches are certainly needed. Fortunately, DBT is one such tool to enable this.
Where historically, ETL was managed outside of the Data Warehouse using some third party proprietary tool, the main architectural change which DBT makes is to move the transformation step inside the Data Warehouse, where it can be executed after the data has been loaded. The process and the acronym therefore changes from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform).
This two character change sounds simple, but it has many positive implications:
- Carrying out the transformations within the database is simpler as there is less external technology to implement, manage and learn;
- Transformations can be defined in open standard SQL rather than some proprietary or GUI based tool;
- SQL based transformations can be checked into source control, tested, and turned into reusable components making the process more rigorous and repeatable;
- The transformations will likely be faster to run because they benefit from the full power of the warehouse;
- We can make use of the full capability of the database including the security model, user defined functions and cloud elasticity when building and running the transformation code;
- Both source data and transformed data can be retained in the database for different use cases.
This shift from ETL to ELT has been made viable by industry developments such a cheaper cloud storage such that provided by AWS S3, and the separation of storage and compute provided by databases such as Snowflake. Indeed, though DBT can be ran against many legacy data warehouses, it is these modern data stacks are where ELT and DBT is finding it's natural home.
In addition to being a step change in how we do ETL, DBT is said to bring the practices of software developers to the data realm, implementing a controlled software development lifecycle and set of patterns. For example:
- The scripts which describe the DBT transformations are based on SQL and stored as simple text files and configuration files. These can be therefore be stored in a source code management system, included in branching strategies and code reviews, and versioned properly so we have repeatable builds and deployments;
- DBT scripts can be ran using a lightweight command line interface. This could happen on the developers desktop, the analysts desktop, or incorporated into CI/CD deployment pipelines and ran with something like Jenkins. They are not trapped in some proprietary tool;
- DBT incorporates testing and assertion frameworks to ensure that the transformation runs successfully and produces correct and consistent idempotent results on each run;
- DBT models can be chained together into pipelines with dependencies, meaning that one step can only run if the proceeding one is successful. Again, this can be incorporated into the CI/CD process for increased reliability and robustness as we promote changes through environments;
- DBT is aware of environments and profiles, meaning that we can easily execute our scripts against development, test and production environments in a controlled way. (Integrating with something like Snowflake zero copy clones really makes production realistic testing of data transformations a reality);
- In addition to the automation and SDLC benefits, DBT also supports more collaborative ways of working and breaks the dependency and bottleneck on central data teams. For instance, because DBT is based around SQL and simple configuration files, we can break out of the central data team, and allow different teams to have access to and potentially take ownership of their DBT scripts.
All in all, DBT is a great tool for teams that want to manage their data using high quality engineering practices following a similar journey as software engineers have done over recent years.
In the next lab we will begin using the DBT CLI and configure our first project.