Transforming Data in the ELT paradigm

What is data transformation?

Data transformation is a process that changes the structure of the existing data into some other structures (thus ‘transform’).

Common use cases of data transformations include:

  • Data cleaning: You correct, remove or update inaccurate records from a recordset.
  • Aggregate data: You aggregate data into a more summarized version of itself. For example, calculating transaction counts by different region and category, and storing that into a new table.
  • Pre-computation: You calculate new numbers according to a business formula during a transformation step, turning raw data into a business measure (e.g. You calculate a ratio, or a trend).

Data transform

Example of transformation that happens inside a data warehouse using SQL

Why do I need to transform data?

Raw data that is piped into your data warehouse is usually in a format designed for storing transactional information. To analyze this data effectively, a transformation step is needed to make it easier to work with. Some people call this ‘data modeling’. We will talk about modeling in the next chapter, but note here that data transformation includes more than just modeling.

Applying good data transformations will yield the following benefits:

  • Reusability: Think of each data transform as a data component that expresses some business logic, and that this data component may be reused multiple times in different reports and analyses.
  • Your reports and analyses will be more consistent: Because of the above reusability property, when it comes to reporting, instead of rewriting your SQL logic to multiple reports, the logic is written in just one transform and is reused in multiple reports. This helps you avoid the scenario where two different data reports produce two different numbers for the same metric.
  • Improve overall runtime performance: If data is transformed and aggregated, the amount of computation you will need to do down the road only happens once at the time of running. This reduces report processing time and improves performance significantly.
  • Cost effectiveness: Less repeated computation will lead to lower processing and server costs overall.

Implementing data transformation in the ELT paradigm

In the past, data transformation was often done by an ETL tool, before the loading process into the data warehouse. This meant significant data engineering involvement, as it was the engineers who created and maintained such transformation pipelines. The pipelines were also often implemented in some programming or scripting language.

It is no accident that we introduced the concept of ELT instead, which has gained traction alongside the growth of more powerful data warehouses. In this paradigm, data is loaded into a data warehouse before it is transformed. This allows us to do two things.

First, it allows us to write transformations using SQL alone. A transformation is thus the creation of a new table that will be created within the same data warehouse, storing the results of that transform step.

Second, it allows analysts to write transformations. This removes the dependency on data engineering, and frees us from building and maintaining pipelines in external tools. (We’ve already covered this in 2.3 ETL vs. ELT - What’s the big deal?, so we recommend that you read that if you haven’t done so already.)

Let’s talk about what this actually looks like.

Imagine that we’re running a hotel booking website, and want to create a summarization of daily bookings according to a few dimensions. In this case, we want to look at dimensions like country, as well as the platform on which the booking was made.

Our raw bookings data would look like this after being loaded into the data warehouse:

Table bookings {
  id integer [pk]
  email varchar
  country_code varchar
  platform varchar
  user_id integer
  listing_id integer
  created_at timestamp
}

Table countries_code {
  country_code varchar [pk]
  country_name varchar
}

(syntax written using DBML)

We want to transform them into a bookings_daily table like so.

Transform and aggregate

So to implement this, the code for the transform job will be:

-- Transform: Summarize bookings by country and platform
BEGIN;
DROP TABLE IF EXISTS bookings_daily;
CREATE TABLE bookings_daily (
  date_d date,
  country_name varchar,
  platform varchar,
  total integer
);
INSERT INTO bookings_daily (
  date_d, country_name, platform, total
)
**SELECT
  ts::date as date_d,
  C.country_name,
  platform,
  count(*) as total
FROM bookings B
LEFT JOIN countries C ON B.country_code = C.country_code
GROUP BY 1**
COMMIT;

The above code:

  • Creates a table named bookings_daily (or recreates it if the table already exists)
  • Runs an SQL transform query to calculate the aggregation and load the results into the newly created table. In the process, the code also turns country code into proper country name by joining with a countries table.
  • All of this is done inside a database transaction, so that if things fail half-way, the prior state is restored.

To deploy the above SQL code to production, we set up a daily cron job that runs the SQL file. This is the most basic method possible, and the contents of our cron job will look like so:

$ psql transforms/bookings_daily.sql

In the above example, the main transform logic is only within the SELECT statement at the end of the code block. The rest is considered metadata and operational boilerplate.

Besides using an SQL table to store the transform results, we may also opt to create a database view (which means we store the definition only), or we can create a materialized view for it.

Using data transform tools

In practice, using dedicated transformation tools (like Holistics, dbt, dataform and so on) will handle the SQL boilerplate and let you focus on just the core transformation logic.

For example, the below screenshots show how this is done using Holistics:

  • The user focuses on writing SQL to transform data, and the software handles the creation of the model.
  • Holistics allows you to choose to save it just as a “view” or as a “materialized view” (persisted to the data warehouse as a new table).

Holistics data transform

Storage settings

The transformation workflow (or the DAG)

The above section talks about a single transform step. But when you have multiple transforms that depend on each other, you will run into a problem of “dependency management”.

For example, let’s say that you have two transform jobs: one to calculate sales revenue, and the other to calculate sales commissions based on revenue. You will want the commissions job to be run only after the revenue calculation is done.

This concept is called a DAG (directed acyclic graph) workflow. It is best explained by the diagram below.

Directed acyclic graph

In the above diagram:

  • Each node inside the data warehouse represents a table, with the left column (A, B, C, D) being tables loaded from source systems into the data warehouse.
  • The arrows represent dependency. That means that in order to create table E, we need data from table A, B and C.
  • Tables E to I are transformed tables, created by running corresponding transformation jobs.
  • You can clearly see that job E should run after job A, B, C have finished, while job I should run only after both D and F finish. This is the dependency property of a DAG workflow in action.

In practice, most data transformation tools will have support for DAG workflows. This is especially true for classical ETL tools, in the older paradigm. Regardless of which paradigm you’re in, your job will be to focus on managing each transform’s logic and their dependencies.

To continue the earlier example using Holistics, once you define a transform job using SQL, Holistics will automatically read the transformation query and then calculate a dependency workflow for it.

Dependency workflow

A note on traditional Data Transform using ETL

The examples we have been discussing so far is done in the context of the ELT model, which means the transformation happens directly inside the data warehouse.

We shall talk a little about the drawbacks of the more classical approach now. As we’ve mentioned before, data transformation used to take place in a programming or scripting language, before the data is loaded into the data warehouse. Below is an example of a transform step done using Ruby programming language.

# Extract: users is an array of users loaded from the Extract phase
users = load_users_from_crm()

# Transform: select only active users
filtered_users = users.select { |u| u['active'] == true }

# Load: load into data warehouse table
write_to_table(filtered_users, 'reporting.active_users')

The main drawback of this approach is that the majority of the load is now on the single computer that runs the script (which has to process millions of data records).

This worked well when data warehouses were slow and expensive. It also worked well at a time when data volumes were comparatively low. Given these restrictions, data professionals would look for ways to offload all processing outside of the data warehouse, so that they may only store cleaned, modeled data in the warehouse to cut down on costs.

However in recent years, data warehouse costs have gone down significantly, while performance has drastically improved. Today, running transformations in the data warehouse is typically more cost efficient than executing the transformation in code running on a normal machine. Thus we see the trend of moving all processing into the data warehouse itself (ELT).

You may read more about this in the 2.1 Consolidating data from source systems section of our book.

Advanced Topic: Incremental transform to optimize for performance

In our section on “data loading” earlier, we spoke about Incremental Load, where only the differences in data are loaded from source to destination. A similar concept exists with data transformation.

Let’s revisit the earlier example of bookings → bookings_daily. This time, let’s look at how to run this transformation incrementally.

Incremental transform

In the above diagram, you can see that when the transform job runs, only data for 2020-01-04 will be processed, and only two new records will be appended to the bookings_daily table.

How much cost does this save us? Quite a bit, as it turns out.

Imagine that your bookings have 100M records and are growing at a pace of 10,000 records a day:

  • With incremental transformation: you only process 10,000 records a day
  • Without incremental transformation: you process 100M (and growing) records a day.

Incremental Transform in Practice

In practice, an incremental transform in SQL should look something like this:

destination: bookings_daily
incremental:
  enabled: true
  column: date_d
---

SELECT
  ts::date as date_d,
  C.country_name,
  platform,
  count(*) as total
FROM bookings B
LEFT JOIN countries C ON B.country_code = C.country_code
WHERE [[ ts::date > {{max_value}} ]] --this is added to the code.
GROUP BY 1

The [[ ts::date > {{max_value}} ]] is added so that the tool will pull the latest value of the incremental column from the destination table and substitute it within the SQL query. With this, only newer data are materialized into a destination table.

When can you not run incremental transform?

If you look at the above example, it is clear that sometimes you cannot run incremental transform:

  • When your old transformed data keeps changing, and would need to be reloaded
  • A quick observation with incremental transforms is that it usually only works if the transform/load step prior to that is also ‘incremental-ble’ (i.e you may transform the bookings table in an incremental manner), though this might not necessarily be true all the time.