fb

Data Warehouse

In the previous section we spoke about the process of consolidating (Extract & Load) data from multiple source systems into your analytics database. In this post, we’ll talk specifically about your analytics database, i.e your data warehouse.

What is a data warehouse?

A data warehouse is a type of analytics database that stores and process your data for the purpose of analytics. Your data warehouse will handle two main functions of your analytics: store your analytical data & process your analytical data.

Why do you need one? Your data warehouse is the centerpiece of every step of your analytics pipeline process, and it serves three main purposes:

  • Storage: In the consolidate (Extract & Load) step, your data warehouse will receive and store data coming from multiple sources.
  • Process: In the process (Transform & Model) step, your data warehouse will handle most (if not all) of the intensive processing generated from the transform step.
  • Access: In the reporting (Visualize & Delivery) step, reports are being gathered within the data-warehouse first, then visualized and delivered to end users.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/e2cb90d7-32ed-407b-82e2-1511db1740c4/5BCF2E8E-CEB3-4037-95D6-42FF3105014C.jpeg

At the moment, most data warehouses use SQL as their primary querying language.

When should I get a data warehouse?

The TL;DR answer is that it depends. It depends on the stage of your company, the amount of data you have, your budget, and so on.

At an early stage, you can probably get by without a data warehouse, and connect a business intelligence (BI) tool directly to your production database (As we’ve mentioned in A simple setup for people just starting out).

However, if you are still not sure if a data warehouse is the right thing for your company, consider the below pointers:

  • First, do you need to analyse data from different sources?

    At some point in your company’s life, you would need to combine data from different internal tools in order to make better, more informed business decisions.

    For instance, you might want to track your most valuable customers on a weekly basis — which requires you to combine payment information from your credit card processor, financial information from your accounting system, and the activity data your customers generate within your product. This is a lot easier to do if your data is located in one central location than if you were to go to three separate places for analysis.

  • Second, do you need to separate your analytical data from your transactional data?

    If you collect activity logs or other potentially useful pieces of information in your app, it’s probably not a good idea to store this data in your app’s database and have your analysts work on the production database directly. Instead, it’s a much better idea to purchase a data warehouse — one that's designed for complex querying — and transfer the analytical data there instead. That way, the performance of your app isn’t affected by your analytics work.

  • Third, is your original data source not suitable for querying?

    For example, the vast majority of BI tools do not work well with NoSQL data stores like MongoDB. This means that applications that use MongoDB on the backend need their analytical data to be transferred to a data warehouse, in order for data analysts to work effectively with it.

  • Fourth, do you want to increase the performance of your analytical queries?

    If your transactional data consists of hundreds of thousands of rows, it’s probably a good idea to create summary tables that aggregate that data into a more queryable form. Not doing so will cause queries to be incredibly slow — not to mention having them being an unnecessary burden on your database.

https://www.holistics.io/blog/content/images/2019/08/Holistics-Image-Template.001-2-1.png

If you answered yes to any of the above questions, then chances are good that you should just get a data warehouse.

That said, in our opinion, it’s usually a good idea to just go get a data warehouse, as data warehouses are not expensive in the cloud era.

Which Data Warehouse Should I Pick?

Here are some common data warehouses that you may pick from:

  • Amazon Redshift
  • Google BigQuery
  • Snowflake
  • ClickHouse (self-hosted)
  • Presto (self-hosted)

If you’re just getting started and don’t have a strong preferences, we suggest that you go with Google BigQuery for the following reasons:

  • BigQuery is free for the first 10GB storage and first 1TB of queries. After that it’s pay-per-usage.
  • BigQuery is fully managed (serverless): There is no physical (or virtual) server to spin up or manage.
  • As a result of its architecture, BigQuery auto-scales: BigQuery will automatically determine the right amount of computing resources to allocate to each query, depending on the query’s complexity and the amount of data you scan, without you having to manually fine-tune it.

(Note: we don’t have any affiliation with Google, and we don’t get paid to promote BigQuery).

However, if you have lots of data with a rapidly increasing volume, or complex or special use cases, you will need to carefully evaluate your options.

Below, we present a table of the most popular data warehouses. Our intention here is to give you a high-level understanding of the most common choices in the data warehouse space. This is by no means comprehensive, nor is it sufficient to help you make an informed decision.

We just want to run through a brief comparison:

Data Warehouse Brief Comparison

What makes a data warehouse different from normal SQL database?

At this point some of you might be asking:

“Hey isn’t a data warehouse just like a relational database that stores data for analytics? Can’t I just use something like MySQL, PostgreSQL, MSSQL or Oracle as my data warehouse?”

The short answer is: yes you can.

The long answer is: it depends.

Transactional Workloads vs Analytical Workloads

First, it’s important to understand the difference between 2 kinds of database workloads: transactional workloads and analytical workloads.

Transactional workloads is the querying workload that serves normal business applications. When a visitor loads a product page in a web app, a query is sent to the DB to fetch this product, and return the result to the application for processing.

SELECT * FROM products WHERE id = 123

(the query above retrieves information for a single product with ID 123)

Here are several common attributes of transactional workloads:

  • Each query usually retrieve a single record, or a small amount of records (e.g. retrieve the first 10 blog posts in a category)
  • Transactional workloads typically involve simple queries that takes a very short time to run (less than 1 second)
  • Lots of concurrent queries at any point in time, limited by the number of concurrent visitors of the application. For big website this can go to thousands or hundred thousands.
  • Usually interested in the whole data record (e.g. every column in the product table).

Analytical workloads, on the other hand, refers to workload for analytical purposes, the kind of workload that this book talks about. When a data report is run, a query will be sent to DB to calculate the results, and then displayed to end users.

SELECT
  category_name,
  count(*) as num_products
FROM products
GROUP BY 1

(The above query scans the entire products table to count how many products are there in each category)

Analytical workloads, on the other hand, have the following attributes:

  • Each query typically scans a large amount of rows in the table.
  • Each query is heavy and takes a long time (minutes, or even hours) to finish
  • Not a lot of concurrent queries happen, limited by the amount of reports or internal staff members using the analytics system.
  • Usually interested in just a few columns of data.

Below is a comparison tables between transactional vs analytical workload/databases.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/f28e73f2-2aea-4737-afe4-91d9c2e2b847/Untitled.png

Transactional workloads have many simple queries, whereas analytical workloads have few heavy queries.

Underlying Backend for Analytics Database is Different

Because of the drastic difference between the two workloads above, the underlying backend design of the database for the two workloads are very different. Transactional databases are optimized for fast, short queries with high concurrent volume, while analytical databases are optimized for long-running, resource-intensive queries.

What are differences in architecture? you ask. This will take a longer chapter to explain, but the gist of it is that analytical databases use the following techniques to guarantee superior performance:

  • Columnar storage engine: Instead of storing data row by row on disk, analytical databases group columns of data together and store them.
  • Compression of columnar data: Data within each column is compressed for smaller storage and faster retrieval.
  • Parallelization of query executions: Modern analytical databases are typically run on top of thousands of machines. Each analytical query can thus be split into multiple smaller queries to be executed in parallel amongst those machines (divide and conquer strategy)

As you can probably guess by now, MySQL, PostgreSQL, MSSQL, and Oracle databases are designed to handle transactional workloads, whereas data warehouses are designed to handle analytical workloads.

So, can I use a normal SQL database as my data warehouse?

Like we’ve said earlier, yes you can, but it depends.

If you’re just starting out with small set of data and few analytical use cases, it’s perfectly fine to pick a normal SQL database as your data warehouse (most popular ones are ****MySQL, PostgreSQL, MSSQL or Oracle). If you’re relatively big with lots of data, you still can, but it will require proper tuning and configuring.

That said, with the advent of low-cost datawarehouse like BigQuery, Redshift above, we would recommend you go ahead with a data warehouse.

However, if you must choose a normal SQL-based database (for example your business only allows you to host it on-premise, within your own network) we recommend going with PostgreSQL as it has the most features supported for analytics.

We’ve written a detailed blog post discussing this topic here: Why you should use PostgreSQL over MySQL for analytics purpose.