In this lesson we will:
- Introduce the concept of data orchestration;
- Introduce Dagster and explain how it supports data orchestration;
- Explain where Dagster fits in as part of the Modern Data Stack;
- Explain the most interesting features of the platform;
- Briefly compare Dagster with similar platforms including Apache Airflow.
What Is Data Orchestration?
Data Orchestration is the process of moving and manipulating data in order to meet the analytical and operational needs of the modern business.
- Extracting data from various source systems and databases;
- Cleaning, de-duplicating and enhancing data;
- Applying analytics to the data to provide insights to the business;
- Copying data files to the correct locations and downstream systems.
These activities are likely to involve multiple systems and tools, and likely to span lots of different data sources. For this reason, orchestration can be thought of as an integration process that co-ordinates lots of different systems into a coherent end to end solution.
The orchestration tasks will typically need to be run on a continuous and periodic schedule, for instance hourly or daily in order to process new data.
This is the second major responsibility of the orchestrator, managing the jobs so that data is produced in a robust and reliable way, whilst providing operational support to administrators.
The jobs in our data orchestration world are likely to have dependencies on other jobs.
For instance, when a new batch of data is delivered, perhaps we need to run a job to de-duplicate and add new fields into the dataset. Next, we might calculate a number of different analytics suites. After that, we may need to copy the resulting files into some line of business data warehouse ready for consumption.
This gives rise to the concept of a data pipeline, where jobs are executed one after the other and the pipeline only proceeds if each step in the pipeline is succesful.
As well as executing jobs in a sequential pipeline, we may also have situations where some sections of the pipeline can run in parallel, and where the pipelines can branch depending on what is found in the data.
This gives rise to a Dependency Graph or what is sometimes referred to technically as a Directed Acyclic Graph or DAG of jobs.
Executing these DAGs is an efficient and robust way is the key capability provided by Dagster and similar tools.
What Is Dagster?
Dagster is an orchestration and automation platform which makes writing and running this type of automation easier and more maintainable. Dagster will take care of the following:
- Running tasks in an in automated fashion;
- Combining jobs into end to end pipelines and graphs;
- Running graphs of tasks in the most efficient way;
- Providing monitoring and alerting capabilities;
- Automatically retrying failed jobs;
- Providing a visual interface for monitoring the data orchestration processes across the business.
Why Use Dagster?
Data orchestration is a very common requirement within businesses who have any kind of data and analytics capability.
Traditionally, they have delivered this using combination of proprietary ETL tooling and bespoke scripting. These solutions are often difficult and risky to change, require significant maintenence, and can lack robustness and eliability. They will also lack important features such as error handling, retry logging, monitoring and secure role based access control.
Using a platform like Dagster for orchestration avoids all of the above issue, enabling your team with a much more powerful and reliable platform. This means that data teams can focus purely on their own domain rather than building bespoke automation tools.
Other benefits of Dagster include:
- Python logic - The automation tasks are defined using plain Python. This means that they are open, portable and easier to understand than say the logic embedded within a legacy ETL tool;
- Software Development Lifecycle - By moving from a proprietary tool to code, we can benefit from developer-like practices including version control and unit testing;
- Clean and reusable code - Dagster encourages clean code, where each task is seperately defined and abstracted;
- Improved reliability - Dagster will take over responsiblity for running our execution graphs, improving reliability of your data delivery;
- Better audit and logging - Dagster will provide better visibility of what actually happened to help with debugging and audit requirements;
- Single Pain Of Glass - Dagster provides a very powerful administration GUI for full oversight of what is happening in all of the data orchestration workflows.
Where Does Dagster Fit Into The Modern Data Stack
Many data teams are implementing Modern Data Stacks to provide their data and analytics capabilities.
Dagster can be thought of as the glue code, orchestrating data between the different phases of it's journey.
Dagster itself fits our definition of a Modern Data Stack tool. It's open source, lightweight to run, very scalable, and can be used purely as a cloud hosted SaaS solution if you do not with to manage your own instance.
Dagster is not the only tool in the data orchestration space.
Apache Airflow is the most commonly deployed tool. It has many similiarites to Airflow including the Python based DAG, but it is much older which sometimes shows in the architecture.
The key differentiators of Dagster as we see them:
- Dagster introduces the concept of a data asset as a first class citizen, whereas Airflow is more of a job runner;
- Dagster makes it easier to develop and test your pipelines before moving them into production. This reduces the risk of production deploys.
Prefect also has market share in this area. Again it is a Python based platform which is modernising after the lessons of Airflow.