fb

ETL vs. ELT - What’s the big deal?

In previous 2 posts, we talked about the processing of loading data onto data warehouse and the warehouse itself. This post will discuss 2 common technologies called ETL and ELT, what their differences are and why we’re advocating for the latter (ELT).

Do note that some of this post will make more sense after you read Transforming Data in Data Warehouse (chapter 3).

The ETL process

In any organization’s analytics stack, the most intensive step usually lies in data preparation: combining, cleaning, and creating data sets that are ready for executive consumption and decision making.

This function is commonly called ETL (Extract, Transform, and Load), identifying the three distinct stages involved.

In this process, an ETL tool extracts the data from different RDBMS source systems then transforms the data like applying calculations, concatenations, etc. and then load the data into the Data Warehouse system.

What happens in this approach:

  1. You extract data from sources (write queries on your SQL databases, or send data extract requests to an application via its API).
  2. Extracted data will be transformed in the ETL server’s memory & procesing.
  3. Transformed data is then loaded in the final data storage/warehouse.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/5258b51b-5e3a-4e7a-a603-e0e11d760305/Untitled.png

The ELT process

ELT is a different method of looking at the tool approach to data movement. Instead of transforming the data before it’s written, ELT lets the data warehouse to do the transformation. Your data will loaded into the data warehouse, and then transformed in place.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/80960644-5e70-41ca-a466-84cc3ef74c28/Untitled.png

What happens in this approach:

  1. You extract data from sources.
  2. Instead of transforming in-memory right away, you load extracted data straight into the destination data storage/warehouse.
  3. Finally, you perform any necessary transformation right in your data warehouse

The two above looks similar, what’s the big deal?

At first glance, the difference of these two approaches is just in the order of execution. However this makes a big difference.

  • In ETL, you use a normal computer server to transform the data, and load them to data warehouse.
  • In ELT, you can leverage the power of modern data warehouses to transform data.

With ETL, as long as your

Historically, building a datawarehouse was very expensive. The cost for a data warehouse (both software and hardware) 20-30 years ago can go up to millions of dollars, just for.

Traditionally, data warehouses licenses are expensive, perhaps to the order of hundred thousands dollar per CPU core.

So to minmimize cost people want to load only clean, transformed data into data warehouse for analytics purpose.

Terrabytes of data can be scanned and aggregated in mere seconds in analytics database, because these these technologies can scale out to multiple distributed machines), while the same cannot be performed in a traditional ETL setup.

Offloading the transformation workload to the data warehouse also removes the headache of managing resources of your ETL server.

As stated in Our Biases , we favor ELT over ETL, and we believe it is the way a modern analytics stack should be set up.