In the previous section we spoke about the process of consolidating (Extract & Load) data from multiple source systems into your analytics database. In this post, we’ll talk specifically about your analytics database, i.e your data warehouse.
A data warehouse is a type of analytics database that stores and process your data for the purpose of analytics. Your data warehouse will handle two main functions of your analytics: store your analytical data & process your analytical data.
Why do you need one? Your data warehouse is the centerpiece of every step of your analytics pipeline process, and it serves three main purposes:
At the moment, most data warehouses use SQL as their primary querying language.
The TL;DR answer is that it depends. It depends on the stage of your company, the amount of data you have, your budget, and so on.
At an early stage, you can probably get by without a data warehouse, and connect a business intelligence (BI) tool directly to your production database (As we’ve mentioned in A simple setup for people just starting out).
However, if you are still not sure if a data warehouse is the right thing for your company, consider the below pointers:
First, do you need to analyse data from different sources?
At some point in your company’s life, you would need to combine data from different internal tools in order to make better, more informed business decisions.
For instance, you might want to track your most valuable customers on a weekly basis — which requires you to combine payment information from your credit card processor, financial information from your accounting system, and the activity data your customers generate within your product. This is a lot easier to do if your data is located in one central location than if you were to go to three separate places for analysis.
Second, do you need to separate your analytical data from your transactional data?
If you collect activity logs or other potentially useful pieces of information in your app, it’s probably not a good idea to store this data in your app’s database and have your analysts work on the production database directly. Instead, it’s a much better idea to purchase a data warehouse — one that's designed for complex querying — and transfer the analytical data there instead. That way, the performance of your app isn’t affected by your analytics work.
Third, is your original data source not suitable for querying?
For example, the vast majority of BI tools do not work well with NoSQL data stores like MongoDB. This means that applications that use MongoDB on the backend need their analytical data to be transferred to a data warehouse, in order for data analysts to work effectively with it.
Fourth, do you want to increase the performance of your analytical queries?
If your transactional data consists of hundreds of thousands of rows, it’s probably a good idea to create summary tables that aggregate that data into a more queryable form. Not doing so will cause queries to be incredibly slow — not to mention having them being an unnecessary burden on your database.
If you answered yes to any of the above questions, then chances are good that you should just get a data warehouse.
That said, in our opinion, it’s usually a good idea to just go get a data warehouse, as data warehouses are not expensive in the cloud era.
Here are some common data warehouses that you may pick from:
If you’re just getting started and don’t have a strong preferences, we suggest that you go with Google BigQuery for the following reasons:
(Note: we don’t have any affiliation with Google, and we don’t get paid to promote BigQuery).
However, if you have lots of data with a rapidly increasing volume, or complex or special use cases, you will need to carefully evaluate your options.
Below, we present a table of the most popular data warehouses. Our intention here is to give you a high-level understanding of the most common choices in the data warehouse space. This is by no means comprehensive, nor is it sufficient to help you make an informed decision.
We just want to run through a brief comparison:
At this point some of you might be asking:
“Hey isn’t a data warehouse just like a relational database that stores data for analytics? Can’t I just use something like MySQL, PostgreSQL, MSSQL or Oracle as my data warehouse?”
The short answer is: yes you can.
The long answer is: it depends.
First, it’s important to understand the difference between 2 kinds of database workloads: transactional workloads and analytical workloads.
Transactional workloads is the querying workload that serves normal business applications. When a visitor loads a product page in a web app, a query is sent to the DB to fetch this product, and return the result to the application for processing.
SELECT * FROM products WHERE id = 123
(the query above retrieves information for a single product with ID 123)
Here are several common attributes of transactional workloads:
Analytical workloads, on the other hand, refers to workload for analytical purposes, the kind of workload that this book talks about. When a data report is run, a query will be sent to DB to calculate the results, and then displayed to end users.
SELECT category_name, count(*) as num_products FROM products GROUP BY 1
(The above query scans the entire products table to count how many products are there in each category)
Analytical workloads, on the other hand, have the following attributes:
Below is a comparison tables between transactional vs analytical workload/databases.
Transactional workloads have many simple queries, whereas analytical workloads have few heavy queries.
Because of the drastic difference between the two workloads above, the underlying backend design of the database for the two workloads are very different. Transactional databases are optimized for fast, short queries with high concurrent volume, while analytical databases are optimized for long-running, resource-intensive queries.
What are differences in architecture? you ask. This will take a longer chapter to explain, but the gist of it is that analytical databases use the following techniques to guarantee superior performance:
As you can probably guess by now, MySQL, PostgreSQL, MSSQL, and Oracle databases are designed to handle transactional workloads, whereas data warehouses are designed to handle analytical workloads.
Like we’ve said earlier, yes you can, but it depends.
If you’re just starting out with small set of data and few analytical use cases, it’s perfectly fine to pick a normal SQL database as your data warehouse (most popular ones are ****MySQL, PostgreSQL, MSSQL or Oracle). If you’re relatively big with lots of data, you still can, but it will require proper tuning and configuring.
That said, with the advent of low-cost datawarehouse like BigQuery, Redshift above, we would recommend you go ahead with a data warehouse.
However, if you must choose a normal SQL-based database (for example your business only allows you to host it on-premise, within your own network) we recommend going with PostgreSQL as it has the most features supported for analytics.
We’ve written a detailed blog post discussing this topic here: Why you should use PostgreSQL over MySQL for analytics purpose.