In this lesson we will:

  • Learn about DBTs seed data feature;
  • Explain how seed data should appropriately be used.

What Is DBT Seed Data?

Most of our work with DBT will involve taking some source data ingested from some application, and transforming it into our views and tables in our desired destination formats and structured.

As part of this process, we may require external static data or lookup tables to enable the transformation. For instance:

  • To map short country codes to country names (e.g. US > United States);
  • To map a list of product codes to longer descriptions (e.g. PZ04 > Pepperoni Pizza)
  • Something specific to the business domain such as a hard coded list of staff or office locations.

DBTs Seed Data feature allows us to store this data directly within the DBT project formatted as a CSV file where it can be properly source controlled, and then bought into the Data Warehouse as part of the DBT run.

The advantage of managing seed data in this way is that it becomes scripted, and can be bought into source control as part of the DBT project. When we have this seed static data being loaded into the warehouse in a reliable way, we can then build our transformations, safe in the knowledge that the static data and lookup tables we need will be present prior the transformation running.

Correct Use

This seed data feature is not designed as a production quality data loading tool to bring large data extracts into your warehouse. Rather, it is there to support ad-hoc loads of small static data tables required to support and enhance the transformations which you write.

Using Seed Data

Seed data is stored within the DBT project as a CSV file inside the data folder of your project. We can create an example like so:

cd data
touch country_codes.csv

We can then edit the file:

nano country_codes.csv

And paste the following CSV formatted data:

Code, Country_Name
US, United States
UK, United Kingdom
FRA, France
AUS, Austria
CH, China 

After saving the file, we can now import the seed data file into the database with the following command:

dbt seed

If we log back into Postgres, we will see that we have a table created which is named was based on the CSV file.

postgres -TBC
use pizzastore_analytics
select * from country_codes 

Outputs:

TBC

Note that the column types are automically inferred from the data when seeding data. If you wish to specify a different type for the columns, this can be overridden in DBT YAML files.

Next Lesson:
09

Testing With DBT

In this lesson we will use the testing features of DBT to validate data transformations and pipelines.

0h 15m



Continuous Delivery For Data Engineers

This site has been developed by the team behind Timeflow, an Open Source CI/CD platform designed for Data Engineers who use dbt as part of the Modern Data Stack. Our platform helps Data Engineers improve the quality, reliability and speed of their data transformation pipelines.

Join our mailing list for our latest insights on Data Engineering:

Timeflow Academy is the leading online, hands-on platform for learning about Data Engineering using the Modern Data Stack. Bought to you by Timeflow CI

© 2023 Timeflow Academy. All rights reserved