There seems to be a knowledge gap around Data Orchestration tools such as Airflow and Dagster in the Data Engineering community.
Though most Data Engineers have a surface level understanding of what they do, they don't quite understand the value of them, and either don't always opt to include them in their stack or just use them as dumb script runners.
These tools are so powerful for improving end to end Data Engineering workflows though. For instance:
Scheduling jobs - You can schedule your jobs without Cron, based on time schedules or events such as when new data arrives. This moves from batch to more frequent and dynamic pipeline runs.
Clean and maintainable code - You can break your code out of proprietary tools and have it defined in Python, including classes, modules, unit testing etc. This code can be checked into source control, versioned, included in CI/CD pipelines etc.
Seperate environments from logic - You can seperate out your environment definitions from your environment details, making it easier to run the same code in dev/test/prod.
Operations - You can monitor, alert, re-run parts of the pipeline etc via the GUI.
Distribute workloads - You can run the tasks in a compute cluster such as Kubernetes for parallelism and scale up.
I know this isn't news to some people, but my impression was that Airflow and therefore the entire orchestration space was a dumb script runner with some depdendency management.
DBT and doing transformations in the database gets all of the attention as part of the Modern Data Stack conversation, but there is a ton of glue code and Python based analytics which can be cleanly handled in an orchestration platform.
I believe that a business with anything beyond rudimentary requirements should be looking to stand up one of these platforms to manage their data flows.