Course Overview
Orchestrating Data Platforms With Dagster

Core Concepts Of Dagster

Lesson #2

In this lesson we will:

  • Explain some of the core concepts and terminology within Dagster.

Dagster Concepts

For SQL engineers or Software Developers moving into the Data Engineering field, workflow automation software such as Airflow and Dagster tend to be a new class of software that they haven't seen before.

For this reason, it is worth introducing the core concepts and terminology before moving forward.

Assets

Dagster introduces the concept of a data asset. An asset can be thought of as a piece of data that is stored on disk. This could represent source data, data which is the middle of a transformation process, or data that is ready to be served to some consumer.

A Dagster graph will therefore create a series of assets as it moves through the graph. Some of these assets could be discarded when the pipeline completes succesfully.

Metadata

When producing an asset, it may be useful to capture metadata such as the number of rows or average price of an order. This can be used to validate that the asset was created correctly, and can also be exposed to users through the Dagster GUI.

Operations (Ops)

An operation is one discrete step which you want to carry out against your software assets. Example operations might be to download a file, to transform it, to anonymise it or to copy it to it's eventual destination.

Pipeline

Operations can be chained together into a pipeline of steps with dependencies between each step.

Graph or DAG

In reality, operations are combined together in more complex ways than simple serial pipelines. A better analogy is the graph. A graph can have paralell phases and can have depdendencies on multiple steps such that we have a complex network of dependent operations.

Orchestration

Data Orchestration is what Dagster is doing at a high level as described in the previous lesson.

Jobs

We define a job. A job will have an associated Dag, as well as things like a schedule on which the job should run.

A job can be defined using a graph, a collection of software assets or a collection of ops.

Schedules

Schedules are used to schedule automatic runs of our Dagster pipelines on a periodic basis.

Sensors

Sensors tell us when source data has changed so we know when to trigger. A sensor could be a file which has been updated or an asset which has been created by another job.

Next Lesson:
02

Setting Up Dagster For Local Development

In this lesson we will setup Dagster for local development and testing.

0h 15m



Continuous Delivery For Data Engineers

This site has been developed by the team behind Timeflow, an Open Source CI/CD platform designed for Data Engineers who use dbt as part of the Modern Data Stack. Our platform helps Data Engineers improve the quality, reliability and speed of their data transformation pipelines.

Join our mailing list for our latest insights on Data Engineering:

Timeflow Academy is the leading online, hands-on platform for learning about Data Engineering using the Modern Data Stack. Bought to you by Timeflow CI

© 2023 Timeflow Academy. All rights reserved