ETL vs. ELT - What’s the big deal?
In the previous sections we have mentioned two terms repeatedly: ETL, and ELT. In this section, we will dive into details of these two processes, examine their histories, and explain why it is important to understand the implications of adopting one versus the other.
Do note that some of this will make more sense after you read Transforming Data in the ELT paradigm.
The ETL process
In any organization’s analytics workflow, the most intensive step usually lies in the data preparation step: that is, combining, cleaning, and creating data sets that are ready for business consumption and decision making.
This function is commonly known as ETL (Extract, Transform, and Load), which identifies the three distinct stages involved.
In this process, an ETL tool extracts the data from different data source systems, transforms the data by applying calculations, concatenations, and the like, and finally loads the data into the data warehouse.
What happens in this approach:
- You extract data from sources (write queries on your SQL databases, or send data extract requests to an application via its API).
- Extracted data will be transformed in the ETL tool’s memory.
- Transformed data is then loaded in the final data storage/warehouse.
The key things to note here is that raw data is transformed outside of the data warehouse, usually with the help of a dedicated “staging server”; and that only transformed data is loaded into the warehouse.
The ELT process
ELT is a different way of looking at this problem. Instead of transforming the data before it is loaded into the database, ELT does the transformation within the data warehouse. Your data will be loaded into the data warehouse first, and then transformed in place.
What happens in this approach:
- You extract data from sources.
- Instead of transforming in-memory, using a pipelining tool, you load the raw, extracted data straight into the destination data storage/warehouse.
- Finally, you perform any necessary transformations within your data warehouse
The key things to note here are that raw data is transformed inside the data warehouse without the need of a staging server; your data warehouse now contains both raw data and transformed data.
The shift from ETL to ELT
Historically, building a data warehouse was a very expensive undertaking, both on the hardware side and on the software side. The server costs, implementation costs and software licenses for a data warehousing project 20 to 30 years ago could easily go up to the millions of dollars and take months to implement.
- Since data warehouses were so expensive, to save on cost, people would only want to load clean, properly transformed and aggregated data into the data warehouse.
- Practitioners were still following waterfall development models back then, so it was acceptable to take the time to plan out and perform proper transformations.
In this context, the ETL model made perfect sense: raw data was properly transformed in a staging server (or ETL pipelining tool) before being loaded into your ridiculously expensive data warehouse. The volume of data that was handled by such tools back then was relatively small, and thus manageable for most staging servers to handle.
But the ETL approach has a number of drawbacks when viewed through a more contemporary lens:
- Every new set of transformations would require involvement from IT or from data engineering, in order to code up new transformations. The ETL tools used in the old paradigm were hardly accessible to data analysts after all, who would traditionally come from an SQL background. As a result, data analysts relied on data engineering for access to new transformed data, and would often have to wait for days before they could get to implement new reports.
- As data sizes increased, the ETL approach became more and more problematic. Specifically, the staging server — that is, the machine that orchestrated all the loading and transforming of data — began to be a bottleneck for the rest of the stack.
So what changed? Well, a couple of things emerged in the 2010s that made an alternative approach possible:
- First, we saw the commoditization of the cloud data warehouse. Modern data warehouses today can store and process a very large amount of data at very little cost.
- We also saw an explosion in the amount and in the variety of data being collected. Some of us have heard of this change as the ‘big data revolution’ — which was a fad in the mid 2010s. The end result of that fad, however, was good for all of us: it pushed the development of new tools and new approaches to data, all of which were built around the assumption of needing to deal with terabyte-level data volumes at a minimum.
- And finally, we saw the rise of lean and agile software development practices. Such practices meant that people began to expect more from their data departments, the same way that they were used to quick execution speeds in their software development teams.
And so at some point, people began to realize: the cost of storing and processing data had become so cheap, it was now a better idea to just dump all your data into a central location, before applying any transformations.
And thus lay the seed that grew into the ELT approach.
In contrast to ETL, an ELT approach has a number of advantages:
- It removes the performance bottleneck at the staging server/ETL pipelining tool. This is significant because data warehouses had increased in processing power at a level far beyond the most advanced ETL pipelining tool. The ELT approach assumes a powerful data warehouse at its core.
- It does not demand detailed planning on what data to transform beforehand. Data practitioners began to take a more agile approach to analytics, aka “dump first, transform later”.
- With proper transform and modeling tools, ELT did not require data engineers to be on standby to support any transformation request from the analytics team. This empowered data analysts, and increased execution speed.
As stated in 1.4 Our biases of a good analytics stack , we favor ELT over ETL, and we believe that all modern analytics stacks should be set up this way.
Below is a short table to summarize the differences between ETL and ELT.
|- Data warehouse cost is very expensive (millions of dollars) - Data volume is still manageable. - People are forced to practice waterfall development.
|- Cloud data warehouse drives the cost of storing and processing data down significantly (hundreds/thousands of dollars only) - Data volume explode. - Agile practices are possible.
|Raw data is transformed in a staging server. Only transformed data is loaded into the data warehouse. Transformations rely on the server’s processing power.
|Raw data is loaded into the data warehouse. Transformations are done within the data warehouse. Results are also stored within the data warehouse. Transformations rely on data warehouse processing power.
|Data warehouse only contains cleaned, transformed data ⇒ maximize utilization of data warehouse. Doesn’t work well when data volume increase ⇒ bottlenecks on the staging server. Usually take weeks/months to change process due to waterfall approach.
|All data is stored in the cloud data warehouse ⇒ very easy to change up new data warehouse. Doesn’t need additional staging servers. Assuming a modern data warehouse, works well when data volume increases. Takes only days to transform/introduce new data.
What About Data Lakes?
At this point, it’s worth asking: what about data lakes? How does that fit into the ELT vs ETL paradigm that’s we’ve just discussed?
Let’s back up a moment.
A data lake is a fancy term for a central staging area for raw data. The idea is to have everything in your organization dumped into a central lake, before loading it into your data warehouse. Unlike data warehouses (which we have talked about extensively in our discussion about ELT, above) lakes are often object buckets in which you may upload all manner of unstructured data: examples of buckets are services like AWS S3 or Google Cloud Storage; examples of unstructured data are CSV dumps or even text files, exported from various source systems.
It is important to understand that a data lake is not a new idea. Since the 80s, business intelligence projects have usually included a staging area for data. ETL systems would then take data from that staging area and transform it within the tool, before loading it into data warehouses. The only thing that is new here is the term itself — and that term harkens back to a 2010 blog post by Pentaho CTO James Dixon.
We have no strong feelings about data lakes. The point of ELT is to load the unstructured data into your data warehouse first, and then transform within, rather than transforming data in-flight through a pipelining tool. Whether this raw data sits in an object bucket before loading is of little concern to us.
To conclude, when you are picking analytics tools, ask yourself: does this tool assume an ETL approach, or does it assume an ELT approach? Anything that requires a data transformation step outside the data warehouse should set off alarm bells in your head; it means that it was built for the past, not the future.
Pick ELT. As we will soon see in Chapter 3, ELT unlocks a lot more than just the operational advantages we’ve talked about above.