Blog
Business Intelligence

Read Data Research Papers

A hack for when you don't understand data jargon.

Read Data Research Papers

This is a short post, because I’m working on a longer piece that will come out next week.

Let’s say that you’re a new data analyst, or a new data engineer. You see some jargon that you don’t understand. Your instinct is to Google around.

Most times, this works just fine. You find some collection of articles that explain what you need, and then you walk away with a better understanding of the topic.

In some cases, however, like with OLAP cubes, you Google and open a bottle of wine and then two hours later you’re drunk and you still don’t know what OLAP cubes even mean.

Like you may read an article that says ‘oh, an OLAP cube is just a table’.

And then you read The Rise and Fall of the OLAP Cube and it says that the OLAP Cube is a data structure.

Wikipedia doesn’t say anything about the implementation.

Kyligence says that OLAP necessarily involves cubes:

OLAP (short for OnLine Analytical Processing) is an approach designed to quickly answer analytics queries involving multiple dimensions. It does this by rolling up large, sometimes separate, datasets into a multidimensional database known as an OLAP Cube.

So what gives?

Data Papers are a Thing

The OLAP cube is an interesting example because it is a term that has meant different things at different times. At various points over the past 30 years it has meant:

  • A type of user interface for browsing data
  • A type of data structure that may be called an ‘aggregated table’ (Or, more accurately, a view of data within a relational database (Gray, Bosworth, Lyaman & Pirahesh, 1995))
  • A highly optimised data structure/data engine that exists as an alternative product to databases and data warehouses.
  • Or just … OLAP; sometimes people conflate OLAP and OLAP cube together, which, ugh.

But I don’t want to talk about this right now. We’ll do a deep dive of the literature next week.

This week, I want to make a simpler, more pragmatic point.

If you’re new to data, and you don’t understand jargon, and a cursory Google search doesn’t help for some term you want to understand better, well …. you can always go and read the academic papers.

In other words:

  • Don’t read more blog posts by vendors. (Including this one!)
  • Don’t read more tweets.
  • Don’t read more Wikipedia.

Instead, know that there is academic research on data technologies! Know that there are papers written on cube implementations and columnar databases and distributed systems that make up the bread and butter of our work!

Why do this?

The reason for doing this is simple. Reading data papers helps you cut through the bullshit. It takes advantage of two properties of our industry.

The first property is something I described last year, in OLAP != OLAP Cube:

The art of business intelligence advances when new ideas are published in research papers and when new ideas are implemented in commercial products. More often than not, the marketing materials of these products use terminology that's inherited from academia.

(Of course, sometimes the marketing materials of data products use terminology that’s mixed up — which is how we got to our current sorry state).

But it is important to know that this world of data technology research papers exist. In fact, it’s often the case that tech innovations are published in academia first, before they are built into real-world products.

To give you a taste of this, Michael Stonebraker’s C-Store: A Column-oriented DBMS, was the first paper to take a serious look at column-oriented database infrastructures, thus laying the groundwork for our current crop of columnar data warehouses. A more recent example might be Materialize, one of the most exciting new data companies, which came out of work done at Microsoft Research.

A reverse example also exists. MapReduce was invented at Google, and then seeped out from the company and into research with their 2004 paper presentation at OSDI. This in turn triggered a brief flourishing of commercial MapReduce implementations, which in turn changed the nature of the data job for data professionals around the world.

Reading data papers is thus a way to read the origins of many of the ideas we take for granted today. And you get to do it without any of the fluff.

This leads us to a second property that reading data papers takes advantage of — namely, that data papers are a prestige thing. Many papers — especially if they are presented at a technical conference — will be peer-reviewed before presentation. (As a general rule of thumb, conferences are more prestigious than journals in the field of Computer Science).  This means they are more likely to be clearly written, and whatever definitions they use will be precise.  To not do so is to invite reviewer scorn.

This in turn means that the incentive for data publications is always for the authors to produce clear, precise explanations for technical concepts. They are not writing for marketing purposes; heck, they aren’t even writing for developers. Data papers are written by authors for an audience of their peers — an audience of researchers that mostly compete on the originality and the rigor of their ideas.

In other words, when you read a paper that describes OLAP cube implementations and the authors say that ‘data cubes may be stored in a relational system or a proprietary MDDB (multi-dimensional database) system’, you know — with pretty good confidence — that an OLAP cube is not just ‘a table that is used to power a particular kind of analysis’, no matter what some blog post on the internet so confidently asserts.

How Is This Useful?

It’s useful because you can get to the bottom of things very quickly.

I don’t mean this as a flippant thing! When I was easing into data, I kept coming across conflicting descriptions of ‘OLAP cube’, and then I kept stumbling across terms like ‘massively parallel processing’ and ‘columnar datastore’ and ‘snowflake schema’. At some point, after reading the nth vendor explainer article, I thought to myself “surely there must be a distributed systems/database literature to all of this?”

And indeed there was.

The protocol for finding these papers are very simple. Here’s what you do:

  1. Google for the data term you want to find information on. Add keywords like ‘implementation’ or ‘paper’ or ‘conference’.
  2. Zoom in on the first result that is an academic paper.
  3. Download that paper, perhaps by Googling the exact title, or by using cough that website for finding research that cough we are all cough not supposed to use.
  4. Now the fun bit: read the lit review. The lit review is usually the second section of any technical paper, right after the abstract and introduction. The lit review is where you are most likely going to get a concise description of the technology; it will also give you a lay of the land of all the research that’s come before.
  5. If you don’t find anything useful, no matter — look for citations in the lit review that reference prior work.
  6. Repeat steps 3-5 by searching for those referenced papers, until you have your answer.

From experience, this usually takes no more than three hops, at most (some very gnarly ideas might take five, but this has literally only happened once). The good news: this technique scales for whatever data infrastructure question you might possibly have!

Give it a try — I think you’ll be surprised at how useful this trick might come in handy. As always, godspeed, and good luck.

Cedric Chin

Cedric Chin

Staff writer at Holistics. Enjoys Python, coffee, green tea, and cats. I'd love to talk to you about the future of business intelligence!

Read More