fb

Consolidating Data From Multiple Sources

In a typical organization, data sits in many different systems. Such fractured data provide only small pieces of the whole puzzle. It is necessary to bring data from different sources to a centralized place in order to get at the big picture.

In this chapter, we will walk you through the basics of the data consolidation process.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9c3eef27-1b04-4d26-a9a8-eb54cc012558/Untitled.png

The different sources of data

Let’s start by talking about the different sources of data that a business might generate.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/1b9eff79-2967-4868-b9d7-011170f2fe27/Untitled.png

In a typical organization, it is common to have these three types of data sources:

1- Data coming directly out of your main application

If you are a company that does most of its business through a website or application (like Facebook, for instance) then this is the data that exists within your main app. Such application databases typically contain all the transactional data that is critical to your business operations.

For example:

  • An eCommerce store must have a database containing customers, products, orders and so on.
  • A ride-hailing app must have a database containing customers, drivers, cars, and booking information

This data is usually stored in an SQL (or NoSQL!) database that is directly connected to your application. To access this data, you usually connect to it via a Database Query Tool.

2- Data coming from Customer Relationship Management (CRM), Content Management System (CMS), Enterprise Resource Planning (ERP), or marketing systems

These systems are third-party applications that are used in your organization’s operations. For instance:

  • CRM tools: Pipedrive, Salesforce, Zendesk
  • Analytics: Google Analytics, Firebase, Amplitude
  • Marketing: Facebook Ads, Google Ads, DoubleClick

These tools are often managed by a third-party, and usually you do not have direct access to the data stored within it. You’ll have to get your data via a company-provided API and store that data in your own data store.

3- Manual data created by employees and other systems

The third and last category is data that is created manually by employees in your organization.

This typically includes formats like Excel spreadsheets and Google Sheets, but may sometimes come in the form of CSV files, text documents, or (god-forbid) PDF reports.

Since this data is created by humans without enforced structures and rules, they are usually the most prone to error.

The central datastore for analytics

The above systems are places where you store & transact data, not where you run analysis. For that purpose, you will need a data warehouse.

A data warehouse is a central database to store & process large amount of data for analytical purposes. You may think of it as the place where you dump a copy of all your data from the other source systems.

You will need a data warehouse for two main purposes:

  1. First, you can’t combine data from multiple business functions easily if they sit in different sources.
  2. Second, your source systems are not designed to run heavy analytics, and doing so might jeopadize your business operations as it increases the load on those systems.

A data warehouse is nothing more than a database that is optimized for analytical processing purposes. In the next section, we’ll dive deep on the data warehouse, but for now, let’s stay focused on the data loading process.

The data loading process is simply the work of extracting data from various multiple sources and loading them onto your data warehouse. This process is also called Extract & Load. Let’s look at that now.

How does Extract & Load process look like?

Extract & Load (EL) is quite a straightforward concept: a program is written to extract raw data from a data source, and that same data will be copied (loaded) over to a destination.

For example, the following pseudo-code loads booking data from source (a MySQL application) and copies them into an SQL data warehouse. It involves three steps:

  • Run queries to extract all data from source table bookings
  • Create the destination database table based on a specific structure.
  • Load the data into the destination datawarehouse.

    sourcedb = connecttosourcedb(); destdb = connecttodestdatawarehouse();

    Extract step

    sourcerecords = sourcedb.query( “SELECT id, email, userid, listingid, created_at FROM bookings” );

    create the database table if not exists.

    destdb.query(“ CREATE TABLE IF NOT EXISTS dw.bookings ( id integer, email varchar, userid integer, listingid integer, createdat datetime ); ”);

    Load step

    for record in sourcerecords { destdb.query(“ INSERT INTO dw.bookings (id, email, userid, listingid, created_at) VALUES ( $1, $2, $3, $4, $5 ) ”, record ) }

(This is a crude implementation to illustrate a naive loading process)

The above script is relatively simple since it queries directly from an SQL database, and it doesn’t handle things like failure, smart retries, and so on.

You then schedule to run the above script on a daily basis by creating a ‘cron job’ on a server, as follows:

00 1 * * * /path/to/script

This line basically says: run this script everyday at 1AM in the morning. (Note: a cron job is the way Linux servers schedule repeating tasks. You create new jobs by editing a file in the operating system called a ‘crontab’. Each job is a line in the file, which looks like the line above.)

In actual production, your script will be a lot more complicated, since there are also other considerations like performance optimization, failure handling, source systems interface and incremental loading. Over time, the human cost to maintain your scripts will far out-weight the actual value it brings you.

That’s why it is usually better to consider adopting an existing data load tool instead. The only exception here is if your business has specialized needs, or if you operate at the scale of big tech companies like Netflix, Google and Facebook.

Using a Data Loading Tool

The good news is that there are a large number of free and paid data loading tools in the market. These tools often behave in a plug-and-play manner. They provide a user-friendly interface to connect to your data sources and data storage, set loading intervals and load modes, and they likely also deliver reports about all your data loading jobs.

These data load tools are commonly known as ETL tools, a term that you might hear a lot about. We’ll discuss this in detail in a subsequent section.

Paid tools like StitchData or Talend often boast a large selection of data integrations, from web analytics platforms, Software-as-a-Service web applications, and NoSQL and SQL databases alike.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/5356fdfa-47cb-4212-b851-bcf257613dc5/Untitled.png

Basic data loading capabilities are also often bundled in data analytics platforms like Holistics, Google Data Studio, and Tableau. Though the number of integrations is often not as extensive as dedicated tools, they are sometimes enough for basic reporting needs.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/3c6ba585-0ffc-4954-b7e8-4b11f0d0d836/Untitled.png

You can also go for open-source software, though you will likely need to spend some time setting things up and integrating them into your systems.

One recent example of this category that we find particularly interesting is Meltano - a platform that bundle many open-sources tools together as a full ETL platform.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/8d2d7c82-7e11-44bf-9ec7-bee2e255c98e/Untitled.png

These are some data load tools on the market for you to consider:

  • Alooma
  • HevoData
  • Holistics
  • Meltano
  • StitchData
  • Talend
  • Pentaho

If you still need to be convinced that writing your own data load & ETL scripts is a bad idea, check out this great article by Jeff Magnusson, who wrote it his capacity as the VP of Data Platform at Stitch Fix.

Common concepts to take note

Loading Data Incrementally

One common concept worth mentioning is the concept of incremental load, which is the notion of loading data (you guessed it!) incrementally to your data warehouse. As your source systems grow in volume, this incremental load concept will become more important to ensure your system runs smoothly.

Let’s revisit the earlier example of loading bookings data, but this time, let’s look at how to run this transformation incrementally.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/d8508d71-4cbe-485a-95cf-9e7c68ffce15/Untitled.png

We can see that when an incremental loading job runs, only data for 2020-01-04 will be queried and copied over to the new table.

In our script, there’s an additional step we have to write in order to grab the most recently created bookings.

source_db = connect_to_source_db();
dest_db = connect_to_dest_datawarehouse();

max_value = dest_db.query("SELECT max(created_at) FROM bookings")

# Extract step - this time we only extract recent records
source_records = source_db.query("
  SELECT id, email, user_id, listing_id, created_at
  FROM bookings
  WHERE created_at > $1", max_value
");

# the rest of the load step happens normally.
...

Most industry standard data load tools should have support for incremental load.

How much performance gain does this approach get you? A lot.

Imagine that you have 100M booking records and that those records are growing at a pace of 10,000 records a day:

  • With incremental load: you copy 10,000 records each day
  • Without incremental load: you process 100M records (and more as time goes by!) each day.

This is a 10,000 times difference in load.

How often should I perform data loading?

Based on our experience, most businesses just need a daily refresh of data.

It’s important to note that unless your use case absolutely requires it, it is not very important to get real-time data in most business analytics.

To understand why, think about this: If you want to view sales data over the last seven weeks, is it necessary for the data to account up to the minute you’re requesting it?

Most business use cases just need a daily refresh of analytics data.

Most of the time, a setup that we see is that the organization has a pipeline that runs after midnight and finishes before people get to work. This is so that when business users log in in the morning, all of their analytics reports are refreshed with yesterday’s data.

Summary

  • There are three different type of source data systems: application data, third-party data and manual data.
  • To store and process data for analytics, you need a thing called data warehouse.
  • The process to move data from source to destination is called Extract & Load.
  • We went over how the Extract & Load process looks like in practice, and recommend that you use off-the-shelf tools. We also talk about how incremental load can help you increase the performance of your EL process.

In the next section, we talk about the next logical piece of the puzzle: the Data Warehouse.