The Data Lakehouse is a Thing

The data lakehouse is a thing, Superset is also a thing, and how to make dashboards using a product thinking approach.

February 03, 2021 · 5 min read · Huy Nguyen

We spent the last three weeks publishing a longer newsletter, which we're now reposting to our blog. If you'd like more of this, sign up here. Enjoy!

Inner Join is Holistics's weekly business intelligence newsletter. This week: so it seems that the data lakehouse is a thing, Superset is also a thing, and how to make dashboards using a product thinking approach.

Insights From Holistics

It seems like Databricks has been everywhere lately.

In the past two weeks alone, I've listened to two podcasts with Ali Ghodsi, the CEO and founder of Databricks, read three articles about the company, and then started to dig into the whole 'data lakehouse' pattern that he's been shouting from the rooftops. If this is marketing, it sure as hell is working on me.

Here's the quick summary: yesterday, Databricks announced it raised $1 billion dollars at a $28 billion dollar valuation. They took money from all three of the major cloud providers (which, what?!), Salesforce Ventures, and then a host of public investors like T. Rowe Price and Tiger Global and BlackRock and I'm sure a couple of pension funds with some of your money (assuming that you're reading this and living in the US and you have a pension, that is).

But this is a newsletter about the practice of business intelligence, not necessarily about the vendors in business intelligence, so why am I paying attention?

Well, a couple weeks ago, Holistics staff writer Cedric wrote, on hybrid data infrastructures:

Our take is that it’s too early to tell what the ideal ‘hybrid’ data infrastructure should look like — that is, one that can handle both AI and analytics. We know that there is an emerging class of tools that have started to attack this pain point. We know that these tool companies are more often than not venture-backed, which means a clash of approaches is inevitable — both from the vendors and from the VC firms that back them. (Expect to see more such ‘reports’ and ‘debates’ in the near future!)

We also know that we have to keep an eye on the development of this topic, given our position in the industry: if a particular infrastructure looks like it’s going to win, then we’re going to have to evolve our product to take that into account.

This is us paying attention. You see, Databricks has a big take on the whole hybrid data infrastructure shebang: they call it Delta Lake, and their pitch is "we make your data lake as easy to use as a data warehouse, and, hell, you can start acting as if you have a warehouse inside your lake."

This is a compelling pitch.

Databricks is calling this a 'data lakehouse', and the Delta Lake paper is worth reading — if you're into that sort of thing. Delta Lake itself is fairly old (it was launched in April 2019, and open sourced at the same time) — but the funding announcement ties together a few of their product moves in a way that is deliberate and quite possibly designed to make people sit up and take notice.

If so, it's working. We're paying attention.

If you want to dig further, Ali Ghodsi has a recent podcast on a16z with Martin Casado (who pushes back a little on his narratives), and another one with Patrick O'Shaughnessy (who doesn't).

Jamin Ball also has a short take on Delta Lake vs Snowflake (who — to their credit — isn't taking this lying down, and are also rapidly moving from pure warehousing into more data sciencey use cases).

If you're a data practitioner, it may be worth it to keep an eye on both the Databricks and Snowflake cloud platforms. Both their approaches are definitely going to be a thing going forward.

Links From Elsewhere

Apache Superset 1.0 released — Apache Superset is also a thing! Maxime Beauchemin wrote this announcement on the 21st of January, and then Dropbox's Bogdan Kyryliuk published Why We Chose Apache Superset as Our Data Exploration Platform. I have nothing but respect for Beauchemin's work; both are worth a read.

If All You Have is a Database, Everything Looks Like a Nail — Pat Helland, an old hand at database systems, has a fantastic little rant about all the ways he's seen businesses abuse databases over the years. The piece begins with:

Back in 1978 when I took my first job programming, it was to build a database system. At age 22, I didn’t know what that was but it didn’t matter. I needed the job. That startup company eventually became a shutdown company and in 1982, I went to work at Tandem Computers working on the infrastructure that a few years later became Tandem’s NonStop SQL.

I mean, if you're into database history, this is great stuff.

How to Make Dashboards Using a Product Thinking Approach — Shopify's Lin Taylor has a collection of wonderful, pragmatic tips on how to do create good dashboards for business stakeholders. As it turns out, the principles are similar to designing user interfaces for users. Because of course it would.

My favourite visualization:

You can tell Taylor practices what she preaches.

That's it for this week! If you enjoyed this newsletter, I'd be very appreciative if you forwarded it to a friend. And if you have any feedback for me, hit the reply button and shoot me an email — I'm always happy to hear from readers.

As always, I wish you a good week ahead,

Warmly,

Huy,

Co-founder, Holistics.

Huy Nguyen

Data Engineer turned Product; writes SQL for a living.