In this lesson we will:
- Introduce and illustrate the concept of a Data Pipeline;
- Introduce and illustrate the concept of a Graph or DAG;
- Explain the role of workflow orchestrations tools such as Airflow and Dagster.
As we discussed in the previous lesson, the core task of a Data Engineer is to implement the systems which continuously move data from the myriad of data sources around the businesses to destinations such as data warehouse or data lakes.
Though this sounds like a simple mission, there are a number of steps which have to take place:
- Extracting data from the source systems as batches or a continuous stream;
- Cleaning the data;
- Joining the data across datasets;
- Transforming the data ready for analysis or reporting;
- Layering in analytics such as aggregating and summing pre-computations;
- Distributing it to the correct location such as a data warehouse or data lake ready for consumption.
Each of these steps could themsleves require multiple steps. For instance, as part of our analytics pre-computation, we may choose to aggregate order values for each department before rolling up to a global order value total.
Considering this, many tools in the Data Engineering space introduce the concept of a pipeline to capture the end to chain of operations. This could be visualised in the following way:
This pipeline would have the following characteristics:
- If one step fails, the entire pipeline fails;
- It is only possible to start a given stage of the pipeline if the preceeding step succeeds;
- If there is a failure in the pipeline, we can start from the step that fails, without re-computing all of the previous steps;
- The pipeline could be triggered either on a schedule or in response to new data arriving at the source.
The pipeline divides the activities into cleanly seperated steps which can be developed, tested and executed seperately. It is therefore a good mental model, a very useful operational tool, and a useful practice for introducing quality into our data platforms.
Directed Acyclic Graphs (DAGs)
In reality, our workflows are likely to be more complex than serial pipelines with a sequence of steps. For instance, there will likely be some tasks which can run in parallel, and we may need branching logic such as if and else statements as we execute our pipelines.
This gives us more of a slightly more complex model closer to a graph:
Sometimes this is referred to as a Directed Acyclic Graph (DAG), which has the following characteristics:
- Execution of the graph moves from left to right as steps are completed;
- Some elements of the graph can be executed in parallel where it logically makes sense;
- The graph is acyclic, meaning that it doesn't have any cycles;
- Logic can be introduced into the graph, such as executing a task multiple times and introducing if/else checks.
The concept of the DAG shows up in many of the tools in Data Engineeering, including orchestrators such as Airflow and Dagster, and transformation tools such as DBT which can be used to build models within a data warehouse. It is therefore a useful model for all Data Engineers to learn.
Tools For Managing Workflows
A data team of any complexity is likely to generate hundreds of such pipelines and DAGs, pulling data from multiple sources and continuously updating it for muliple consumers.
To support this, many teams will introduce a category of software known as an orchestrator or workflow manager to manage the end to end and ongoing execution of their pipelines and DAGs, which are sometimes referred to as workflows. Example tools in this space include Apache Airflow or Dagster.
The responsibility of these tools is to run the pipeline steps either on a schedule or when new data arrives, and manaage situations such as. This gives Data Teams an overview of all of their executing DAGs.