Understanding The Data Warehouse
In the previous section we spoke about the process of consolidating (Extract & Load) data from multiple source systems into your analytics database. In this post, we’ll talk specifically about your analytics database, i.e your data warehouse.
What is a data warehouse?
A data warehouse is a type of analytics database that stores and processes your data for the purpose of analytics. Your data warehouse will handle two main functions of your analytics: store your analytical data & process your analytical data.
Why do you need one? You will need a data warehouse for two main purposes:
- First, you can’t combine data from multiple business functions easily if they sit in different sources.
- Second, your source systems are not designed to run heavy analytics, and doing so might jeopardize your business operations as it increases the load on those systems.
Your data warehouse is the centerpiece of every step of your analytics pipeline process, and it serves three main purposes:
- Storage: In the consolidate (Extract & Load) step, your data warehouse will receive and store data coming from multiple sources.
- Process: In the process (Transform & Model) step, your data warehouse will handle most (if not all) of the intensive processing generated from the transform step.
- Access: In the reporting (Visualize & Delivery) step, reports are being gathered within the data-warehouse first, then visualized and delivered to end-users.
At the moment, most data warehouses use SQL as their primary querying language.
When is the right time to get a data warehouse?
The TL;DR answer is that it depends. It depends on the stage of your company, the amount of data you have, your budget, and so on.
At an early stage, you can probably get by without a data warehouse, and connect a business intelligence (BI) tool directly to your production database (As we’ve mentioned in 1.2 A simple setup for people just starting out).
However, if you are still not sure if a data warehouse is the right thing for your company, consider the below pointers:
First, do you need to analyze data from different sources?
At some point in your company’s life, you would need to combine data from different internal tools in order to make better, more informed business decisions.
For instance, if you’re a restaurant and want to analyze orders/waitress ratio efficiency (which hour of the week the staff is most busy vs most free), you need to combine your sales data (from POS system) with your staff duty data (from HR system).
For those analyses, it is a lot easier to do if your data is located in one central location.
Second, do you need to separate your analytical data from your transactional data?
As mentioned, your transactional systems are not designed for analytical purposes. So if you collect activity logs or other potentially useful pieces of information in your app, it’s probably not a good idea to store this data in your app’s database and have your analysts work on the production database directly.
Instead, it’s a much better idea to purchase a data warehouse — one that's designed for complex querying — and transfer the analytical data there instead. That way, the performance of your app isn’t affected by your analytics work.
Third, is your original data source not suitable for querying?
For example, the vast majority of BI tools do not work well with NoSQL data stores like MongoDB. This means that applications that use MongoDB on the backend need their analytical data to be transferred to a data warehouse, in order for data analysts to work effectively with it.
Fourth, do you want to increase the performance of your analytical queries?
If your transactional data consists of hundreds of thousands of rows, it’s probably a good idea to create summary tables that aggregate that data into a more queryable form. Not doing so will cause queries to be incredibly slow — not to mention having them being an unnecessary burden on your database.
If you answered yes to any of the above questions, then chances are good that you should just get a data warehouse.
That said, in our opinion, it’s usually a good idea to just go get a data warehouse, as data warehouses are not expensive in the cloud era.
Which Data Warehouse Should I Pick?
Here are some common data warehouses that you may pick from:
- Amazon Redshift
- Google BigQuery
- ClickHouse (self-hosted)
- Presto (self-hosted)
If you’re just getting started and don’t have a strong preference, we suggest that you go with Google BigQuery for the following reasons:
- BigQuery is free for the first 10GB storage and first 1TB of queries. After that it’s pay-per-usage.
- BigQuery is fully managed (serverless): There is no physical (or virtual) server to spin up or manage.
- As a result of its architecture, BigQuery auto-scales: BigQuery will automatically determine the right amount of computing resources to allocate to each query, depending on the query’s complexity and the amount of data you scan, without you having to manually fine-tune it.
(Note: we don’t have any affiliation with Google, and we don’t get paid to promote BigQuery).
However, if you have a rapidly increasing volume of data, or if you have complex/special use cases, you will need to carefully evaluate your options.
Below, we present a table of the most popular data warehouses. Our intention here is to give you a high-level understanding of the most common choices in the data warehouse space. This is by no means comprehensive, nor is it sufficient to help you make an informed decision.
But it is, we think, a good start:
|Pricing & Delivery
|Amazon, as part of AWS offering
|Pay per instance. Starts at $0.25/hr (~$180/month)
|Google, as part of Google Cloud offering
|Pay per data queried & stored. Starts at $0, free first 10GB storage & 1TB queried.
|Developed in-house at Yandex, later open-sourced.
|Free & Open-source. Deploy on your own server.
|Pay per usage. Check website for more info. Cloud-based (on AWS, GCP or Azure).
|Developed in-house at Facebook, later open-sourced. Now managed by Presto Foundation (part of Linux Foundation).
|Free & open source. Deploy on your own server.
What makes a data warehouse different from normal SQL database?
At this point some of you might be asking:
“Hey isn’t a data warehouse just like a relational database that stores data for analytics? Can’t I just use something like MySQL, PostgreSQL, MSSQL or Oracle as my data warehouse?”
The short answer is: yes you can.
The long answer is: it depends. First, we need to understand a few concepts.
Transactional Workloads vs Analytical Workloads
It is important to understand the difference between two kinds of database workloads: transactional workloads and analytical workloads.
A transactional workload is the querying workload that serves normal business applications. When a visitor loads a product page in a web app, a query is sent to the database to fetch this product, and return the result to the application for processing.
SELECT * FROM products WHERE id = 123
(the query above retrieves information for a single product with ID 123)
Here are several common attributes of transactional workloads:
- Each query usually retrieves a single record, or a small amount of records (e.g. get the first 10 blog posts in a category)
- Transactional workloads typically involve simple queries that take a very short time to run (less than 1 second)
- Lots of concurrent queries at any point in time, limited by the number of concurrent visitors of the application. For big websites this can go to the thousands or hundreds of thousands.
- Usually interested in the whole data record (e.g. every column in the product table).
Analytical workloads, on the other hand, refer to workload for analytical purposes, the kind of workload that this book talks about. When a data report is run, a query will be sent to DB to calculate the results, and then displayed to end-users.
count(*) as num_products
GROUP BY 1
(The above query scans the entire products table to count how many products are there in each category)
Analytical workloads, on the other hand, have the following attributes:
- Each query typically scans a large number of rows in the table.
- Each query is heavy and takes a long time (minutes, or even hours) to finish
- Not a lot of concurrent queries happen, limited by the amount of reports or internal staff members using the analytics system.
- Usually interested in just a few columns of data.
Below is a comparison table between transactional vs analytical workload/databases.
Transactional workloads have many simple queries, whereas analytical workloads have few heavy queries.
The Backend for Analytics Databases is Different
Because of the drastic difference between the two workloads above, the underlying backend design of the database for the two workloads are very different. Transactional databases are optimized for fast, short queries with high concurrent volume, while analytical databases are optimized for long-running, resource-intensive queries.
What are the differences in architecture you ask? This will take a dedicated section to explain, but the gist of it is that analytical databases use the following techniques to guarantee superior performance:
- Columnar storage engine: Instead of storing data row by row on disk, analytical databases group columns of data together and store them.
- Compression of columnar data: Data within each column is compressed for smaller storage and faster retrieval.
- Parallelization of query executions: Modern analytical databases are typically run on top of thousands of machines. Each analytical query can thus be split into multiple smaller queries to be executed in parallel amongst those machines (divide and conquer strategy)
As you can probably guess by now, MySQL, PostgreSQL, MSSQL, and Oracle databases are designed to handle transactional workloads, whereas data warehouses are designed to handle analytical workloads.
So, can I use a normal SQL database as my data warehouse?
Like we’ve said earlier, yes you can, but it depends.
If you’re just starting out with small set of data and few analytical use cases, it’s perfectly fine to pick a normal SQL database as your data warehouse (most popular ones are MySQL, PostgreSQL, MSSQL or Oracle). If you’re relatively big with lots of data, you still can, but it will require proper tuning and configuring.
That said, with the advent of low-cost data warehouse like BigQuery, Redshift above, we would recommend you go ahead with a data warehouse.
However, if you must choose a normal SQL-based database (for example your business only allows you to host it on-premise, within your own network) we recommend going with PostgreSQL as it has the most features supported for analytics. We’ve also written a detailed blog post discussing this topic here: Why you should use PostgreSQL over MySQL for analytics purpose.
In this section, we zoomed in into data warehouse and spoke about:
- Data warehouse is the central analytics database that stores & processes your data for analytics
- The 4 trigger points when you should get a data warehouse
- A simple list of data warehouse technologies you can choose from
- How a data warehouse is optimized for analytical workload vs traditional database for transactional workload.