Business Intelligence

How We Combine Data Sources at Holistics

Good data teams combine data from different sources in order to do their jobs well. Here's what we've learnt while putting this into practice at Holistics.

How We Combine Data Sources at Holistics

Holistics is a modern business. Like many modern businesses, we use multiple tools that each generates its own mountain of data.

This inevitably led us to the same problems that many other businesses have faced:

  • How do we combine data from those different tools?
  • How do we map out a single customer journey? (Say, from being interested to purchase and to becoming a return customer?)

There are many solutions out there promising to "bring your data together in one place". However, is it really that simple?

We found out the hard way that it is not.

What kind of data do we have?

As a B2B startup, we did not actually have too many data sources. Initially, this is what we owned:

What kind of headaches do we have?

The first thing to think about before doing any stitching is to ensure data quality and availability. Here, we encountered several problems:

Sales and customer support data are subject to human error

Some fields on Pipedrive are automatically populated, but there is still a lot of manual input. This process is error-prone, and sales records often suffer from missing fields. The same issue presents itself in customer support data on Zendesk, as we have custom fields to categorize tickets but such fields were created only recently.

Data from Google Analytics is aggregated, not raw

For web traffic, initially we were stuck with Google Analytics's reporting system whenever we wanted to do complicated queries or visitor visualization. GA's API returned only aggregated data, so even when we used a connector to pull that data into our own storage, there was not much we could do with it.

Back-end data cannot answer everything

Backend data is good for high-level reporting, but not good enough for analyzing user behavior. If we want to know customers' behaviors down to each click and page view, we had to add tracking logic directly to the code. This could potentially slow the app down, and makes it more difficult for the product team to add new measurements.

There is no unified customer identification across the apps

This was the most obvious problem we faced. Logically, the customer account data from the platform should match with account data in Pipedrive, but for various technical reasons, we did not have a common key to match these two.

Matching between traffic data and sales data is also not possible, because with Google Analytics we only know "how many people did what", not "who did what".

What we did to stitch data from these sources

Define a procedure to help match data from different sources

To match the sales accounts and their back-end customer IDs in our platform, we decided to ask for internal help from our growth team. (In Holistics, the growth team is responsible for sales and marketing). We agreed that whenever prospects activated their accounts on the platform, growth team members must enter the backend customer ID into a custom field in Pipedrive.

Periodically we would also run queries to check for data inaccuracies and discrepancies between the two data sources. For example: active customers without a backend customer ID, or churned customers with a net MRR larger than 0.

Use an open-source event tracking solution

If you stick to Google Analytics but want to match your traffic data and sales data, you would have to pay for Google Analytics 360 (which would cost you about $150,000 a year). This is something we could not afford, and besides, why pay so much if you have other equally awesome and cheaper alternatives?

After doing some research, we ended up deploying Snowplow to send hit-level data to our own BigQuery storage instead of Google Analytics. With Snowplow, we solved several problems at the same time: we finally had access to raw traffic data in order to do customer journey mapping, and we can now map visitors with their sales accounts to do more customer segmentation.

We also use Snowplow and Google Tag Manager to track customer usage in our web app, so that the product team would not have to rely too much on engineering whenever they want to add a new tracker.

Using Holistics to model user behavioral data:

What we learned from this project

The problems that we face at Holistics is merely a fraction of all the possible problems when you want to combine disconnected data sources. Obviously, our solution is by no means universal, though it may apply for small-scale, young startups. We won't claim that our current solution is future-proof: personnel changes, new tools we add to our workflow, and new analytical needs may make our current approach obsolete.

However, there are a few takeaways we've got from this experience that we believe will continue to guide us in the future:

  • Collaboration between departments is the key. As members of the data team, you need to keep communicating with representatives of other departments every step of the way — from problem statement to solution proposal, to eventual implementation. And of course, you should always remain open to what your other departments are willing to do in any analytics project.
  • Compromises are inevitable. For instance, we wanted to automatically pass customer IDs from our back-end into Pipedrive to reduce manual load, but doing so turns out to take too much engineering time for too little payoff. That is why we decided to live with manual input and periodic data cleaning.
  • New tools will give you new problems. As the saying goes, with more power comes more responsibilities. When we deployed a powerful and flexible tool like Snowplow, a whole new set of problems that we had never considered turned up alongside it. We had to define a new procedure for teams to request tracking, and created a management system for the tracking itself so we know what we already have and what we should measure for future features.

Overall, we are incredibly pleased with our current analytics setup, and find that we are now much more productive when digging for insights about Holistics's customer behavior within Holistics itself. We hope these experiences will prove useful to you.