Consolidating data from source systems

In a typical organization, data sits in many different systems. Such fractured data provide only small pieces of the whole puzzle. It is necessary to bring data from different sources to a centralized place in order to get at the big picture.

In this chapter, we will walk you through the basics of the data consolidation process.

consolidating data

Different types of source data

Let’s start by talking about the different sources of data that a business might generate.

Since this book focuses on the technical aspect of building the analytics stack, the categorization below is meant to explore the difference from the technical point of view.

Ingest data

In a typical organization, it is common to have these three types of data sources:

1- Data coming directly out of your main application (application database)

If you are a company that does most of its business through a website or application (like Amazon, for instance), this is the data that exists within your main app. Such application databases typically contain all the transactional data that is critical to your business operations.

For example:

  • An eCommerce store must have a database containing customers, products, orders and so on.
  • A ride-hailing app must have a database containing customers, drivers, cars, and booking information

This data is usually stored in an SQL (or NoSQL!) database that is directly connected to your application. To access this data, you usually connect to it via a Database Query Tool.

2- Data coming from Customer Relationship Management (CRM), Content Management System (CMS), Enterprise Resource Planning (ERP), or marketing systems

These systems are third-party applications that are used in your organization’s operations. For instance:

  • CRM tools: Pipedrive, Salesforce, Zendesk
  • Analytics: Google Analytics, Firebase, Amplitude
  • Marketing: Facebook Ads, Google Ads, DoubleClick

These tools are often managed by a third-party. Usually, you do not have direct access to the data stored within it. You need to get the data via a company-provided API and store that data in your own data store.

3- Manual data created by employees and other systems

The third and last category is data that is created manually by employees in your organization.

This typically includes formats like Excel spreadsheets and Google Sheets, but may sometimes come in the form of CSV files, text documents, or (god-forbid) PDF reports.

Since this data is created by humans without enforced structures and rules, they are usually the most prone to error.

The central datastore for analytics

The above systems are places where you store & transact data, not where you run analysis. For that purpose, you will need a data warehouse.

A data warehouse is a central database to store & process a large amount of data for analytical purposes. You may think of it as the place where you dump a copy of all your data from the other source systems.

A data warehouse is nothing more than a database that is optimized for analytical processing. We’ll dive deep on the data warehouse in the next section, but for now, let’s stay focused on the data loading process.

The data loading process is the work of extracting data from multiple sources and loading them onto your data warehouse. This process is also called Extract & Load. Let’s look at that now.

The Extract & Load process

Extract & Load (EL) is quite a straightforward concept: a program is written to extract raw data from a data source, and that same data will be copied (loaded) over to a destination.

For example, the following pseudo-code loads booking data from source (a MySQL application) and copies them into an SQL data warehouse. It involves three steps:

  • Run queries to extract all data from source table bookings
  • Create the destination database table based on a specific structure.
  • Load the data into the destination data warehouse.
source_db = connect_to_source_db();
dest_db = connect_to_dest_datawarehouse();

# Extract step
source_records = source_db.query(
  "SELECT id, email, user_id, listing_id, created_at FROM bookings"
);

# create the database table if not exists.
dest_db.query("
  CREATE TABLE IF NOT EXISTS dw.bookings (
    id integer,
    email varchar,
    user_id integer,
    listing_id integer,
    created_at datetime
  );
");

# Load step

for record in source_records {
  dest_db.query("
    INSERT INTO dw.bookings (id, email, user_id, listing_id, created_at)
    VALUES ( $1, $2, $3, $4, $5 )
  ", record )
}

(This is a crude implementation to illustrate a naive loading process)

The above script is relatively simple since it queries directly from an SQL database, and it doesn’t handle things like failure, smart retries, and so on. This script can then be set up to run every day in the early morning time.

In actual production, your script will be a lot more complicated, since there are also other considerations like performance optimization, failure handling, source systems interface and incremental loading.

Over time, the human cost to maintain your scripts will far out-weight the actual value it brings you.

That’s why it is usually better to consider adopting an existing data load tool instead. The only exception here is if your business has specialized needs, or if you operate at the scale of big tech companies like Netflix, Google and Facebook.

Using a Data Loading Tool

The good news is that there are a large number of free and paid data loading tools in the market. These tools often behave in a plug-and-play manner. They provide a user-friendly interface to connect to your data sources and data storage, set loading intervals and load modes, and they likely also deliver reports about all your data loading jobs.

These data load tools are commonly known as ETL tools. This might cause confusion to beginner readers though, since most of the modern tools we look at don’t do the Transform step, and ask you to use a dedicated Transformation tool.

Paid tools like StitchData or Talend often boast a large selection of data integrations, from web analytics platforms, Software-as-a-Service web applications, and NoSQL and SQL databases alike.

ETL tools

That isn’t to say that you need such sophisticated tools all the time, though. Basic data loading capabilities are also often bundled in data analytics platforms like Holistics, Google Data Studio, and Tableau. Though the number of integrations are often not as extensive as dedicated tools, they are sometimes enough for basic reporting needs.

data loading

If you don’t want to go for one of the more comprehensive, paid options, you may also choose to go for open-source software. Particularly famous packages include Airflow and Prefect. As with all open-source software, you will likely need to spend some time setting things up and integrating such tools into your systems.

A recent example of this category that we find particularly interesting is Meltano — a platform by GitLab that focuses on providing open-source data pipelines.

Data Pipeline

It will be interesting to see if Meltano can galvanize an open-source community around it. If they succeed, we would have a large set of data loaders for as many services as possible under the sun.

Anyway, let’s wrap up. These are some proprietary data loading tools on the market for you to consider:

And here are a couple of great open source options (though — as with all things open source, caveat emptor):

If you still need to be convinced that writing your own data load & ETL scripts is a bad idea, check out this great article by Jeff Magnusson, who wrote it in his capacity as the VP of Data Platform at Stitch Fix.

Common Concepts

How do I load data incrementally?

One common concept worth mentioning is the concept of incremental load, which is the notion of loading data (you guessed it!) incrementally to your data warehouse. As your source systems grow in volume, this incremental load concept will become more important to ensure your system runs smoothly.

Let’s revisit the earlier example of loading bookings data, but this time, let’s look at how to run this transformation incrementally.

Incrementally load data

We can see that when an incremental loading job runs, only data for 2020-01-04 will be queried and copied over to the new table.

To reflect this in the above pseudo-code, there’s an additional step we have to write in order to grab the most recently created bookings.

source_db = connect_to_source_db();
dest_db = connect_to_dest_datawarehouse();

max_value = dest_db.query("SELECT max(created_at) FROM bookings")

# Extract step - this time we only extract recent records
source_records = source_db.query("
  SELECT id, email, user_id, listing_id, created_at
  FROM bookings
  WHERE created_at > $1", max_value
");

# the rest of the load step happens normally.
...

Most industry-standard data load tools should have support for incremental load.

How much performance gain does incremental load get you? A lot.

Imagine that you have 100M booking records and that those records are growing at a pace of 10,000 records a day:

  • With incremental load: you copy 10,000 records each day
  • Without incremental load: you process 100M records (and more as time goes by!) each day.

This is a 10,000 times difference in load.

How often should I perform data loading?

Based on our own experience, most analytics use cases in most businesses just need a daily refresh of data.

It’s also important to note that unless your use case absolutely requires it, it is not very important to get real-time data in most business analytics. To understand why, think about this: If you want to view sales data over the last seven weeks, is it necessary for the data to account up to the minute you’re requesting it?

Most business use cases just need a daily refresh of analytics data.

A common setup that we see is that the organization has a pipeline that runs after midnight and finishes before people get to work. This is so that when business users login in the morning, all of their analytics reports are refreshed with yesterday’s data.

Summary

  • There are three different types of source data systems: application data, third-party data and manual data.
  • To store and process data for analytics, you need a thing called data warehouse.
  • The process to move data from source to destination is called Extract & Load.
  • We went over how the Extract & Load process looks like in practice, and recommend that you use off-the-shelf tools. We also talk about how incremental load can help you increase the performance of your EL process.

In the next section, we talk about the next logical piece of the puzzle: the 2.2 Understanding The Data Warehouse.