In this lesson we will:
- Explain some of the core concepts and terminology within Dagster.
Dagster Concepts
For SQL engineers or Software Developers moving into the Data Engineering field, workflow automation software such as Airflow and Dagster tend to be a new class of software that they haven't seen before.
For this reason, it is worth introducing the core concepts and terminology before moving forward.
Assets
Dagster introduces the concept of a data asset. An asset can be thought of as a piece of data that is stored on disk. This could represent source data, data which is the middle of a transformation process, or data that is ready to be served to some consumer.
A Dagster graph will therefore create a series of assets as it moves through the graph. Some of these assets could be discarded when the pipeline completes succesfully.
Metadata
When producing an asset, it may be useful to capture metadata such as the number of rows or average price of an order. This can be used to validate that the asset was created correctly, and can also be exposed to users through the Dagster GUI.
Operations (Ops)
An operation is one discrete step which you want to carry out against your software assets. Example operations might be to download a file, to transform it, to anonymise it or to copy it to it's eventual destination.
Pipeline
Operations can be chained together into a pipeline of steps with dependencies between each step.
Graph or DAG
In reality, operations are combined together in more complex ways than simple serial pipelines. A better analogy is the graph. A graph can have paralell phases and can have depdendencies on multiple steps such that we have a complex network of dependent operations.
Orchestration
Data Orchestration is what Dagster is doing at a high level as described in the previous lesson.
Jobs
We define a job. A job will have an associated Dag, as well as things like a schedule on which the job should run.
A job can be defined using a graph, a collection of software assets or a collection of ops.
Schedules
Schedules are used to schedule automatic runs of our Dagster pipelines on a periodic basis.
Sensors
Sensors tell us when source data has changed so we know when to trigger. A sensor could be a file which has been updated or an asset which has been created by another job.