Navigating The Business Intelligence Tool Space
In biology, a taxonomy is a scheme of classification for living things. It helps scientists in the field categorize different types of animals, such as ‘this fish has a swim bladder’ and ‘this fish has a really prominent dorsal fin’ and ‘oh my god this fish has a lantern sticking out of its head’.
Business intelligence tools can be roughly categorized in the same way that scientists categorize animals. Having a taxonomy for the business intelligence space is useful because it allows you to quickly place the tool within the first few minutes of a sales call (or within the first few minutes of browsing a vendor’s website). As we’ve mentioned in the previous section, the BI landscape can be incredibly confusing because tools from previous paradigms stick around for a long, long time. Having a categorization scheme in your head cuts through all that noise.
In this section, we’re going to build on the historical narrative of the previous section by giving you some of the most useful ways you may categorize such tools.
SQL vs Non-SQL
Some business intelligence tools demand knowledge of SQL. Others do not.
We’ve already talked about Tableau’s workflow in the previous section. Tableau assumes that data workers would have already transformed the data into a form that is suitable for analysis before handing it to a business user. The Tableau user then loads such data into Tableau, in order to generate visualizations and reports. Tableau assumes no knowledge of SQL in order to produce such visualizations.
Similarly, BI tools that operate on top of OLAP cubes tend to not use SQL. For example, Microsoft’s MDX language was developed specifically for operations on a data cube, and came to prominence with their SSAS suite. The language was then adopted by many other OLAP cube vendors in the market, and is today considered a solid alternative to SQL.
Other BI tools are unabashed about their SQL-orientation. Chartio, Redash, Mode, Cluvio and Holistics fall into this category. All of these tools come with powerful SQL query editors. All of them assume some familiarity with SQL.
There are variants on this approach, of course:
- Holistics treats direct SQL query creation as a secondary access path. We believe that data analysts should do the bulk of their business intelligence work with a data modeling layer.
- Chartio comes with a sophisticated Visual SQL mode, which makes SQL more accessible to non-technical users … but without giving up any of the power of the underlying language.
- Mode Analytics assumes that data extracted from your data warehouse should be used for both data science (that is, available for analysis within Jupyter Notebooks) as well as for business intelligence. It has an in-memory datastore that sits in between the tool and your data warehouse, in order to serve both purposes.
- Looker requires all data analysts to write in LookML, which is then translated into SQL for the underlying database.
The shift from non-SQL based analytics to SQL based analytics
Before we close this section, we should point out that there’s an existing shift towards SQL that has happened over the past five years.
This was not always obvious. Starting around 2010, there was a huge amount of hype around NoSQL datastores like MongoDB, Cassandra, and CouchDB. These datastores promised superior performance, but did not use SQL. There was also an earlier wave of excitement over big data technologies like Hadoop and Spark, the majority of which eschewed SQL for proprietary APIs.
Both trends have died in the years since. The vast majority of analytical datastores today have standardized around SQL. Even non-SQL datastores like the proprietary Spanner database from Google — and even Hadoop today! — have adopted SQL as a query language.
In an essay titled The Rise of SQL-Based Data Modeling and Data Ops, Holistics co-founder Thanh argued that the standardization around SQL happened for a few reasons:
- Proficient SQL users were able to utilize the power of any SQL cloud-based data warehouse to produce beautiful charts and dashboards … without the need to learn a new proprietary language or tool. This meant transferrable skills. It also meant that it was easier for companies to hire and train new analysts.
- SQL is text-based, and may be stored in a version control system. This made it trivially easy to manage.
- The dominance of SQL has led to an increased number of BI tools embracing SQL as a primary interface language.
We do not think this trend will reverse anytime soon, for the simple reason that standards only become more entrenched over time. The upshot here is that if you work in data, you should assume that SQL is the lingua franca of data analytics in the future. Pick tools accordingly.
Embedded Datastore vs External Datastore
Early business intelligence tools came with embedded datastores. This was as much for pragmatic reasons as anything else. For instance, the Cognos systems we discussed in the previous section came with embedded datastores, with the expectation that modeled data would be transferred from Cognos Transformer servers to Cognos PowerPlay cubes. This was the norm back then: if you wanted to buy a BI tool, you would be expected to buy the entire shebang — data warehouse, visualization suite and all.
Today, however, things are different. The majority of contemporary BI tools have opted to remain datastore agnostic, and to connect to as many data warehouse products as possible. This was a natural adaptation to the rise of the cost-effective, extremely powerful modern data warehouse. Tools like Metabase, Holistics, Chartio, and Redash belong to this category, as they come with connectors to pretty much every SQL datastore that exists on the market.
(This was also made easier because modern data warehouses have all standardized on SQL. See how awesome standardization is?)
Not every modern BI tool assumes an external datastore, of course. A notable exception to this is the original Sisense product that was launched in 2004. Sisense took an early columnar data warehouse and packaged a business intelligence tool around it. Their unique selling proposition was the fact that Sisense could run large datasets on commodity, off-the-shelf hardware — like a puny consumer laptop, for instance. This was remarkable for its time, especially given the dominant approach of purchasing expensive machines for data warehousing and cube materialization.
With that said, things changed in 2019 when Sisense acquired Periscope Data and rebranded it into the rather verbose ‘Sisense for Cloud Data Teams’. This effectively means that they, too, offer a product that connects to an external datastore today.
Sidebar: Massively Parallel Processing vs Relational Database?
While we are on the topic of connecting to external databases, let’s talk about what we mean when we say a 'modern data warehouse’. Holistics, for instance, connects to MySQL, SQL Server and Postgresql, in addition to cloud data warehouses like Redshift, Snowflake, and BigQuery. Many tools have an equally broad set of connectors. Are these databases all counted as ‘modern’?
The short answer is that no: they’re not. We’ve covered this in a previous section, in Understanding The Data Warehouse, but let’s go through this quickly, again. When we say ‘modern data warehouse’, we really mean data warehouses that have two properties:
- They are a column-oriented database (as opposed to a row-oriented database). Column-oriented databases are adapted for analytical workloads, and match OLAP cubes in performance.
- They have a massively parallel processing (MPP) architecture (as opposed to running on a single machine or a small cluster of machines). MPP architectures mean that such data warehouses arbitrarily scale up or down their computing resources depending on the complexity of your query. It also means that you might pay nothing upfront, but will have to pay for compute time on a variable cost (pay for what you use) basis.
This is not to say that you can’t use Holistics and other similar tools with RDBMSes like MySQL and Postgresql. In fact, we’ve seen many startups start out with a Postgres replica as a primary analytical database.
What it does mean, however, is that row-oriented databases like Postgres will eventually top out in performance once you reach large analytical data sizes. This performance limitation is what led to the ‘data warehouse and cube’ workflows of decades past, and what leads to the ‘cloud-based data warehouse’ workflows of today.
In-Memory vs In-Database
Another variant of this ‘embedded datastore’ vs ‘external datastore’ spectrum is the idea of ‘in memory’ vs ‘in database’ tools.
Tools like Holistics, Redash, Chartio, Metabase and Looker run SQL queries on top of a powerful database. The heavy lifting is done by the database itself; the connected BI tool merely grabs the results of generated queries and displays it to the user.
In contrast, a BI tool like Tableau or PowerBI assumes the analyst will take data out of the central data warehouse and run analysis on their own machines. In terms of workflow, this is similar to taking data out of a central system and then dumping it into Excel. The performance of the analytical tool is thus limited by the power of the tool itself, along with the computational resources of the analyst’s machine.
When you are evaluating BI tools, it helps to understand which process the tool assumes you would use. Does it leverage the power of the data warehouse? Or does it assume you’re going to be pulling data out and running it on an individual analyst’s machine?
Modeling vs Non-Modeling BI Tools
On that note, it’s clear that certain BI tools combine modeling with their core offering, while others do not.
The Cluvio, Redash and Mode Analytics tools we’ve mentioned don’t provide any modeling capabilities whatsoever. In practice, many contemporary data teams that we know either implement ELT techniques using data modeling layers like dbt or Dataform, or use a more traditional ETL approach using tools like Pentaho or Talend.
The key thing to take note of in a non-modeling BI approach is that either you modeled the data using a separate data modeling tool, or you have to hardcode the business logic directly into the report itself. If it’s the latter, you stand the risk of getting into business logic discrepancy, because now there are multiple places in your BI system that contain duplicates of the same logic.
Holistics and Looker are somewhat unique in this sense, in that they include a modeling layer alongside BI functionality. Because of that, your entire logic is centralized in the data modeling layer, thus greatly increase metric consistency and logic reusability across the organization.
An additional benefit of having a modeling layer baked in the same BI tool is maintaining context: you are able to trace the full lineage of report data back to its original form, because the tool plays a part in every transformation along the way.
Our Biases, Revisited
In Chapter 1 of this book, we spent a section talking about our biases regarding a good analytics stack. We wrote then that our biases were:
- We prefer ELT over ETL
- We prefer using a cloud data warehouse over an on-premise data warehouse. We also prefer MPP analytics databases over Hadoop-like systems.
- We believe Data Modeling is essential in an analytics setup and should not be overlooked.
- We think that SQL based analytics will win over non-SQL based analytics.
- We believe that analytics workflow/operations is more important than a singular focus on visualizations.
Given what we’ve covered over the past couple of chapters, and given what we’ve shown you about the landscape of BI tools in this section, you should now be able to place these biases against a larger context.
- We prefer ELT over ETL because we think that ELT gives you the power of more flexible data modeling practices. You may choose to skip the up-front modeling costs of the original Kimball paradigm, and only chose to do so when your reporting requirements call for it.
- We prefer using a cloud data warehouse because modern MPP data warehouses represent a fundamental shift in capabilities. The business model of a modern data warehouse also fits into the overall reorientation to cloud that we’re seeing in the broader marketplace. On this note, we are not alone: nearly every BI tool today assumes that a cloud data warehouse sits at the center of the stack. We have adapted accordingly; you should, too.
- We believe data modeling is essential because data modeling at a central layer enables organizational self-service … without the challenges of inconsistent metric definitions.
- We think that SQL based analytics will win over non-SQL based analytics, because the entire industry has standardized on SQL in the last five years.
- We believe that analytics workflow/operations are more important than a singular focus on visualizations. Much of the difficulty in business intelligence is the work needed to get data to a point of analysis. Beautiful visualizations alone will not determine the success of your department. It is more important that the business not bottleneck on its data team in order to get the insights they need.
So, let’s recap. When you’re in the market for a business intelligence tool, you may categorize the tools you see in the following ways:
|SQL vs Non-SQL
|Non-SQL: Tableau, PowerBI, Sisense
|SQL: Holistics, Looker, Mode, Redash, Metabase
|Embedded Datastore vs External Datastore
|Embedded: MicroStrategy, Tableau, PowerBI, Sisense
|External: Holistics, Looker, Metabase, Redash,
|In-memory vs In-database
|In-memory: Tableau, MicroStrategy, Sisense, PowerBI, etc.
|In-database: Holistics, Looker, Redash, Metabase, etc.
|Modeling vs non-modeling BI tools
|Non-modeling: Tableau, Mode, Redash
|Modeling: Qlik, PowerBI, Looker, Holistics
- SQL vs Non-SQL: Does the tool assume SQL as its primary query interface? Or does it export data out to a non-SQL data engine? Does it use cubes? The answer to this will tell you a lot about the paradigm the tool is from.
- Embedded Datastore vs External Datastore: Does the tool come with an embedded datastore, or does it expect to be connected to an external data warehouse? This, too, tells you to expect a monolithic tool, or one that is designed for the modern era of powerful data warehouses.
- In-Memory vs In-Database: Does the tool assume that data must be extracted out from a source system to an analyst’s machine? Or does it perform all operations within a database? This can have real implications on your analysts’ workflows.
- Assumes ETL vs Assumes ELT: Does the tool assume a particular transformation paradigm? Is it agnostic about the data it is fed?
- Modeling vs Non-Modeling BI Tools: Does the tool include a modeling component, or does it assume data will be modeled separately? Tools with modeling components take pains to give you full context; tools without do not.
There are other considerations when you’re shopping around for a BI tool, of course, but these are usually more obvious things that you already know about: for instance, you might have considerations around pricing (contract vs per-seat), licensing (open source vs proprietary), on-premise status (either on-premise only, or cloud only, or both) and so on. We’ve left them out because they are obvious.
We think that the list we’ve laid out above is the shortest possible taxonomy that gives you the most interesting information about the tools you are evaluating. We hope they give you a way to orient yourself, the next time you look out to the landscape of business intelligence.
Alright. We’re nearly done. We’ve taken you on a quick tour of business intelligence tool history, and we’ve given you a taxonomy with which to organize the field in your head. In our next, final, section, we shall explore the different kinds of data delivery you’ll experience over the arc of your company’s life.