What Is Data Engineering?

Businesses today collect lots of data on how they are operating, performing and interacting with their customers.

This data can be captured from many sources, including websites and mobile applications, internal line of business and enterprise applications, and data from external suppliers and partners.

Businesses also have lots of ways in which they would like to consume and use this data. This requires bringing all of their source data together manipulating and cleaning it, then analysing and mining it for insights. Finally, this will all be exposed through routes such as reports, dashboards, or other endpoints.

So at a high level, businesses have the challenge where at the source, they have a wide variety of structured and unstructured data being created in high volumes and at high speed, whilst on the consuming end, they have people who are looking to consume data and gain insights in ever more complex and creative ways.

Data Engineers are the people with the unenviable task of bringing these two sides together, putting the pipelines in place for continually extracting the data from source systems, manipulating it into the desired formats, layering in analytics, and serving up the data and insights through the correct channels.

Data Engineering is inreasingly a critical role and practice for businesses, but one which even today is poorly understood and defined. Historically, this type of work hasn't had a name, and many of the best practices that we would expect to see were lacking. Data Engineering is giving the industry a job role and a set of best practices for the first time.

The Data Engineering Process

If we break down the problem, Data Engineers need to do three things to move their data from source to destination:

  • Extracting data from the sources, either through APIs, by querying the database directly, or setting up a solution for extracting from the source periodically;
  • Transforming data into joined up, usable formats and structures, and perhaps layering in analytics. This requires expertise in data modelling to structure the data optimally;
  • Loading data into locations where it can subsequently be used. Often this will be a data warehouse or data lake which will service reports, dashboards or ad-hoc analysis by Data Analysts, Data Scientists and business users.

In some instances, the Data Engineers responsibility will also move into areas such as serving up the correct reports and dashboards, though often Data Analysts and Data Scientists will work on the last mile with the Data Engineers moving in a supporting role. This said, nowadays, this distinction is merging, with Data Engineers and Data Analysts knowing more about each other roles.

Of course, this data work isn't just a one time activity. Data Engineers need to put into place pipelines which continually process data as it is created in the sources, and to have these pipelines running reliably, accurately and with fast delivery of data in production. Hopefully it starts to become clear that this isn't an easy task!

A Foundational Activity

Many businesses looking to improve their data capabilities will begin by hiring for skills including Data Scientists and Data Analysts. However, they will then find that these people are not as productive as they could be due to needing to spend significant time on data extraction and preperation.

These newly hired Data Professionals often find, for instance, that they need to manually request raw data extracts from source systems, that the data they get is messy, out of date or has gaps in it, or may be delivered in sub-optimal formats such as Excel spreadsheets. This is not the best use of their time and skills!

With a Data Engineering capability, these plumbing activities are handled on behalf of Data Analysts and Data Scientists, who have more time to apply their niche skills on actual analysis and modelling. Not only this, they will also continually receive up-to-date data through robust data delivery pipelines.

With this in mind, businesses should look at Data Engineering as a foundational capability which has to be put into place before Data Analysts and Data Scientists can be effective, let alone wider enablement of the business.

Modern Data Engineering

ETL has been around as a process and practice within businesses for many decades. However, ad-hoc, unstable, slow data extracts are no longer fit for purpose. As data grows in strategic importance, Data Engineering practices are emerging as the solution to this.

Data Engineering today looks very different to the ETL engineering of even a few years ago.

Where before it was all built on proprietary heavyweight tools and on premise solutions, nowdays there are more SaaS services, cloud native tools and open source solutions in play. For instance, we have tools such as Fivetran and Stitch which are SaaS tools commoditising the ETL process, tools such as DBT which are modernising how we do transformations, platforms such as Kafka for moving from bath to streaming data exchange, and modern cloud data warehouses such as Snowflake and Redshift as data stores. It is one of the most rapidly evolving areas in enterprise IT.

Data Engineering is also incorporating some of the best practices that are found in the software engineering world. These include modularisation, testability, scalability, source code control and code reuse. Data Engineering is bringing a significant professionalisation to the field.

Many companies are investing in their data engineering capabilties today. They are however coming up against a huge shortage of talent and experience in this field as more and more companies realise the value of this capability whilst the skills are still relatively niche. This means that people who do secure modern Data Engineering skills are in high demand in the market.

About Timeflow Academy

Timeflow Academy aims to support engineers who would like to learn about Data Engineering and cross train from other fields.

We do this primarily by providing training on leading tools and technologies including DBT, Clickhouse, Snowflake, Kafka, Spark and Airflow. However, as important as the specific tools is the philosophy, practices and approach of Data Engineering as a practice and an emerging job role.

Read more about Timeflow Academy here, or visit our course list to get started.

This Post Requires A Membership

Sign Up

Already A Member? Log In

© 2022 Timeflow Academy.