Modeling Example: A Real-world Use Case

In this section we are going to walk through a real world data modeling effort that we executed in Holistics, so that you may gain a better understanding of the ideas we’ve presented in the previous two segments. The purpose of this piece is two-fold:

  1. We want to give you a taste of what it’s like to model data using a data modeling layer tool. Naturally, we will be using Holistics, since that is what we use internally to measure our business. But the general approach we present here is what is important, as the ideas we apply are similar regardless of whether you’re using Holistics, or some other data modeling layer tool like dbt or Looker.
  2. We want to show you how we think about combining the Kimball-style, heavy, dimensional data modeling approach with the more ‘just-in-time’, lightweight, ‘model how you like’ approach. This example will show how we’ve evolved our approach to modeling a particular section of our data over the period of a few months.

By the end of this segment, we hope to convince you that using a data modeling layer-type tool along with the ELT approach is the right way to go.

The Problem

In the middle of 2019, we began to adopt Snowplow as an alternative to Google Analytics for all our front-facing marketing sites. Snowplow is an open-source event analytics tool. It allows us to define and record events for any number of things on https://www.holistics.io/ — if you go to the website and click a link, watch a video, or navigate to our blog, a Javascript event is created and sent to the Snowplow event collector that runs on our servers.

Our Snowplow installation captures and delivers such event data to BigQuery. And our internal Holistics instance sits on top of this BigQuery data warehouse.

Snowplow raw event data is fairly complex. The first step we did was to take the raw event data and model it, like so:

Snowplow data

Note that there are over 130 columns in the underlying table, and about 221 fields in the data model. This is a large fact table by most measures.

Our data team quickly realized two things: first, this data was going to be referenced a lot by the marketing team, as they checked the performance of our various blog posts and landing pages. Second, the cost of processing gigabytes of raw event data was going to be significant given that these reports would be assessed so regularly.

Within a few days of setting up Snowplow, we decided to create a new data model on which to run the majority of our reports. This data model would aggregate raw event data to the grain of the pageview, which is the level that most of our marketers operated at.

Notice a few things that went into this decision. In the previous section on Kimball data modeling we argued that it wasn’t strictly necessary to write aggregation tables when working with large fact tables on modern data warehouses. Our work with the Snowplow data happened within BigQuery — an extremely powerful MPP data warehouse — so it was actually pretty doable to just run aggregations off the raw event data.

But our reasoning to write a new data model was as follows:

  • The series of dashboards to be built on top of the Snowplow data would be used very regularly. We knew this because various members of the sales & marketing teams were already asking questions in the week that we had Snowplow installed. This meant that the time cost of setting up the model would be justified over the course of doing business.
  • We took into account the costs from running aggregation queries across hundreds of thousands of rows every time a marketer opened a Snowplow-related report. If this data wasn’t so regularly accessed, we might have let it be (our reasoning: don’t waste employee time to reduce BigQuery compute costs if a report isn’t going to be used much!) but we thought the widespread use of these reports justified the additional work.

Notice how we made the decision to model data by considering multiple factors: the time costs to create a new model, the expected usage rate, and our infrastructure costs. This is very different from a pure Kimball approach, where every data warehousing project necessarily demanded a data modeling effort up-front.

Creating The Pageview Model

So how did we do this? In Holistics, we created this pageview-level data model by writing some custom SQL (don’t read the whole thing, just skim — this is for illustration purposes only):

with 
page_view_stats as (
  select
    {{#e.domain_userid}}
    , {{#e.domain_sessionidx}}
    , {{#e.session_id}} as session_id
    , wp.id as page_view_id
    , {{#e.app_id}}

    , min({{#e.derived_tstamp}}) as pv_start_at
    , max({{#e.derived_tstamp}}) as pv_stop_at
    , timestamp_diff(max({{#e.derived_tstamp}}), min({{#e.derived_tstamp}}), second) as time_spent_secs
    , timestamp_diff(max({{#e.derived_tstamp}}), min({{#e.derived_tstamp}}), second) / 60 as time_spent_mins

    , max((case when {{#e.pp_yoffset_max}} > {{#e.doc_height}} then {{#e.doc_height}} 
          else {{#e.pp_yoffset_max}} end) / {{#e.doc_height}}) as max_scroll_depth_pct

  from {{#snowplow_holistics e}}
  left join unnest({{#e.contexts_com_snowplowanalytics_snowplow_web_page_1_0_0}}) as wp
  left join {{#internal_visitors iv}} on {{#e.domain_userid}} = {{#iv.domain_userid}}
  where {{#e.app_id}} in ('landing', 'docs', 'blog')
    and cast(derived_tstamp as date) >= '2019-07-30'
    and {{#iv.domain_userid}} is null
  group by 1, 2, 3, 4, 5
)

, session_stats as (
  select
    p.domain_userid
    , p.session_id as session_id
    , min(p.pv_start_at) as session_started_at
    , cast(min(p.pv_start_at) as date) as session_started_date
    , sum(time_spent_mins) as session_time_mins
    , round(sum(time_spent_mins) / 60) as session_time_hours
  from page_view_stats p
  group by 1, 2
)

, visitor_stats as (
  select
    p.domain_userid
    , cast(min(case when app_id in ('landing', 'docs') then p.pv_start_at else null end) as date) as first_visited_landing_date
    , cast(min(case when app_id = 'blog' then p.pv_start_at else null end) as date) as first_visited_blog_date
  from page_view_stats p
  group by 1
)

select
  {{#e.app_id}}
  , {{#e.domain_userid}}
  , vs.first_visited_landing_date
  , vs.first_visited_blog_date

  , {{#e.domain_sessionidx}}
  , {{#e.session_id}} as session_id
  , ss.session_started_at
  , ss.session_started_date

  , {{#e.mkt_source_grouping}} as mkt_source_grouping
  , {{#e.utm_source_grouping}} as utm_source_grouping
  , {{#e.utm_referrer_grouping}} as utm_referrer_grouping
  , {{#e.mkt_medium}}
  , {{#e.mkt_campaign}}
  , {{#e.os_timezone}}
  , {{#e.geo_country}}

  , case when {{#e.refr_urlhost}} = 'www.holistics.io' and {{#e.utm_referrer}} is not null 
      then {{#e.utm_referrer_host}} else {{#e.refr_urlhost}} end as refr_urlhost
  , case when {{#e.refr_urlhost}} = 'www.holistics.io' and {{#e.utm_referrer}} is not null 
      then {{#e.utm_referrer_path}} else {{#e.refr_urlpath}} end as refr_urlpath
  , case when {{#e.refr_urlhost}} = 'www.holistics.io' and {{#e.utm_referrer}} is not null 
      then coalesce({{#e.utm_referrer}}, '/') 
      else concat({{#e.refr_urlhost}}, coalesce({{#e.refr_urlpath}}, '/')) end as referrer

  , {{#e.referrer_grouping}} as referrer_grouping 

  , {{#e.page_urlhost}}
  , {{#e.page_urlpath}}
  , concat({{#e.page_urlhost}}, coalesce({{#e.page_urlpath}}, '/')) as page
  , {{#e.page_grouping}} as page_grouping

  , wp.id as page_view_id
  , pvs.pv_start_at
  , pvs.pv_stop_at

  , coalesce(pvs.max_scroll_depth_pct, 0) as max_scroll_depth_pct
  , pvs.time_spent_secs as time_on_page_secs
  , pvs.time_spent_mins as time_on_page_mins

  -- Actions aggregation
  , {{#e.count_click_vid_how_holistics_works}} as count_click_video_how_holistics_works 
  , {{#e.count_submit_demo_email}} as count_submit_demo_email
  , {{#e.count_book_demo}} as count_book_demo
  , {{#e.count_submit_trial_email}} as count_submit_trial_email
  , {{#e.count_request_trial}} as count_request_trial

from {{#snowplow_holistics e }}
, unnest( {{#e.contexts_com_snowplowanalytics_snowplow_ua_parser_context_1_0_0}}) as ua
left join unnest( {{#e.contexts_com_snowplowanalytics_snowplow_web_page_1_0_0}}) as wp

left join session_stats ss on {{#e.session_id}} = ss.session_id
left join page_view_stats pvs on {{#e.session_id}} = pvs.session_id and wp.id = pvs.page_view_id
left join visitor_stats vs on {{#e.domain_userid}} = vs.domain_userid
left join {{#internal_visitors iv}} on {{#e.domain_userid}} = {{#iv.domain_userid}}

where 
  {{#e.app_id}} in ('landing', 'docs', 'blog')
  and {{#e.event}} != 'page_ping'
  and cast({{#e.derived_tstamp}} as date) >= '2019-07-30'
  and {{#e.is_test}} = false
  and {{#iv.domain_userid}} is null

  -- Remove bots
  and not regexp_contains(ua.useragent_family,'(?i)(bot|crawl|slurp|spider|archiv|spinn|sniff|seo|audit|survey|pingdom|worm|capture|(browser|screen)shots|analyz|index|thumb|check|YandexBot|Twitterbot|a_archiver|facebookexternalhit|Bingbot|Googlebot|Baiduspider|360(Spider|User-agent))')
  and  coalesce(regexp_contains( {{#e.refr_urlhost}}, 'seo'), false ) is false
  and  {{#e.page_urlhost}} != 'gtm-msr.appspot.com' 
  and ({{#e.refr_urlhost}} != 'gtm-msr.appspot.com' or {{#e.refr_urlhost}} is null)
group by 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29

Within the Holistics user interface, the above query generated a model that looked like this:

Generated model

We then persisted this model to a new table within BigQuery. The persistence settings below means that the SQL query you saw above would be rerun by the Holistics data modeling layer once ever two hours. We could modify this refresh schedule as we saw fit.

Persisted

We could also sanity check the data lineage of our new model, by peeking at the dependency graph generated by Holistics:

Dependency graph

In this particular case, our pageview-level data model was generated from our Snowplow event fact table in BigQuery, along with a dbdocs_orgs dimension table stored in PostgreSQL. (dbdocs is a separate product in our company, but our landing pages and marketing materials on Holistics occasionally link out to dbdocs.io — this meant it was important for the same people to check marketing performance for that asset as well).

Our reports were then switched over to this data model, instead of the raw event fact table that they used earlier. The total time taken for this effort: half a week.

Evolving The Model To A Different Grain

A few months later, members of our marketing team began to ask about funnel fall-off rates. We were running a couple of new campaigns across a handful of new landing pages, and the product side of the business began toying with the idea of freemium pricing for certain early-stage startup customers.

However, running such marketing efforts meant watching the bounce rates (or fall-off rates) of our various funnels very carefully. As it turned out, this information was difficult to query using the pageview model. Our data analysts found that they were writing rather convoluted queries because they had to express all sorts of complicated business logic within the queries themselves. For instance, a ‘bounced session’ at Holistics is defined as a session with:

  • only one page view, with no activities in any other sessions, or
  • a session in which the visitor did not scroll down the page, or
  • a session in which the visitor scrolled down but spent less than 20 seconds on the page.

Including complex business logic in one’s SQL queries was a ‘code smell’ if there ever was one.

The solution our data team settled on was to create a new data model — one that operated at a higher grain than the pageview model. We wanted to capture 'sessions’, and build reports on top of this session data.

So, we created a new model that we named session_aggr. This was a data model that was derived from the pageview data model that we had created earlier. The lineage graph thus looked like this:

Lineage graph

And the SQL used to generate this new data model from the pageview model was as follows (again, skim it, but don’t worry if you don’t understand):

#standardsql
with 
session_ts as (
  select
    {{#s.domain_userid}}
    , {{#s.domain_sessionidx}}
    , {{#s.session_id}} as session_id
    , min( {{#s.session_started_at}}) as session_started_at
    , max( {{#s.pv_stop_at}}) as session_latest_ts
  from {{#session_pages_aggr s}}
  group by 1, 2, 3
)

, first_page as (
  select * from (
    select
      {{#p1.domain_userid}}
      , {{#p1.domain_sessionidx}}
      , {{#p1.session_id}}
      , {{#p1.mkt_source_grouping}} as mkt_source
      , {{#p1.mkt_medium}}
      , {{#p1.mkt_campaign}}

      , {{#p1.page_urlhost}} as first_page_host
      , {{#p1.page}} as first_page
      , {{#p1.page_grouping}} as first_page_grouping

      , {{#p1.refr_urlhost}} as first_referrer_host
      , {{#p1.referrer}} as first_referrer
      , {{#p1.referrer_grouping}} as first_referrer_grouping

      , row_number() over (partition by {{#p1.domain_userid}}, {{#p1.domain_sessionidx}} order by {{#p1.pv_start_at}} asc) as page_idx
    from {{#session_pages_aggr p1}}
  ) where page_idx = 1
)

select
  {{#p.domain_userid}}
  , {{#p.domain_sessionidx}}
  , {{#p.session_id}}
  , st.session_started_at
  , st.session_latest_ts
  , cast(st.session_started_at as date) as session_started_date
  , fp.mkt_source 
  , fp.mkt_medium 
  , fp.mkt_campaign  

  , first_referrer_host
  , first_referrer
  , first_referrer_grouping

  , first_page_host
  , first_page
  , first_page_grouping

  , string_agg( concat({{#p.page_urlhost}}, {{#p.page_urlpath}}) ) as visited_pages

  , {{#p.count_pageviews}} as count_pageviews
  , {{#p.count_unique_pages_viewed}} as count_unique_pages_viewed
  , {{#p.count_pages_without_scroll}} as count_pages_without_scroll
  , {{#p.sum_time_on_page_secs}} as total_session_time_secs
  , {{#p.sum_time_on_page_mins}} as total_session_time_mins

  -- demo  
  , sum({{#p.count_submit_demo_email}}) > 0 as submitted_demo_email
  , sum( {{#p.count_book_demo}}) > 0 as booked_demo

  -- trial
  , sum({{#p.count_submit_trial_email}}) > 0 as submitted_trial_email
  , sum({{#p.count_request_trial}}) > 0 as requested_trial


from {{#session_pages_aggr p}}
left join session_ts st on {{#p.session_id}} = st.session_id
left join first_page fp on {{#p.session_id}} = fp.session_id
group by 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16

And in the Holistics user interface, this is what that query looked like (note how certain fields were annotated by our data analysts; this made it easier for marketing staff to navigate in our self-service UI later):

Annotation

This session model is regenerated from the pageview model once every 3 hours, and persisted into BigQuery with the table name persisted_models.persisted_session_aggr. The Holistics data modeling layer would take care to regenerate the pageview model first, before regenerating the session model.

With this new session data model, it became relatively easy for our analysts to create new reports for the marketing team. Their queries were now very simple SELECT statements from the session data model, and contained no business logic. This made it a lot easier to create and maintain new marketing dashboards, especially since all the hard work had already been captured at the data modeling layer.

Exposing self-service analytics to business users

It’s worth it to take a quick look at what all of this effort leads to.

In The Data Warehouse Toolkit, Ralph Kimball championed data modeling as a way to help business users navigate data within the data warehouse. In this, he hit on one of the lasting benefits of data modeling.

Data modeling in Kimball’s day really was necessary to help business users make sense of data. When presented with a BI tool, non-technical users could orient themselves using the labels on the dimensional tables.

Data modeling serves a similar purpose for us. We don’t think it’s very smart to have data analysts spend all their time writing new reports for business users. It’s better if their work could become reusable components for business users to help themselves.

In Holistics, the primary way this happens is through Holistics Datasets — a term we use to describe self-service data marts. After model creation, an analyst is able to package a set of data models into a (waitforit) dataset. This dataset is then made available to business users. The user interface for a dataset looks like this:

Datasets

On the leftmost column are the fields of the models collected within the data set. These fields are usually self-describing, though analysts take care to add textual descriptions where the field names are ambiguous.

In Holistics, we train business users to help themselves to data. This interface is key to that experience. Our business users drag whatever field they are interested in exploring to the second column, and then generate results or visualizations in the third column.

This allows us to serve measurements throughout the entire organization, despite having a tiny data team.

Takeaways

What are some lessons we may take away from this case study? Here are a few that we want to highlight.

Let Usage Determine Modeling, Not The Reverse

Notice how sparingly we’ve used Kimball-style dimensional data modeling throughout the example, above. We only have one dimension table that is related to dbdocs (the aforementioned dbdoc.org table). As of right now, most dimensional data is stored within the Snowplow fact table itself.

Is this ideal? No. But is it enough for the reports that our marketing team uses? Yes, it is.

The truth is that if our current data model poses problems for us down the line, we can always spend a day or two splitting out the dimensions into a bunch of new dimension tables according to Kimball’s methodology. Because all of our raw analytical data is captured in the same data warehouse, we need not fear losing the data required for future changes. We can simply redo our models within Holistics’s data modeling layer, set a persistence setting, and then let the data warehouse do the heavy lifting for us.

Model Just Enough, But No More

Notice how we modeled pageviews first from our event data, and sessions later, only when we were requested to do so by our marketing colleagues. We could have speculatively modeled sessions early on in our Snowplow adoption, but we didn’t. We chose to guard our data team’s time judiciously.

When you are in a fast-moving startup, it is better to do just enough to deliver business insights today, as opposed to crafting beautiful data models for tomorrow. When it came time to create the session data model, it took an analyst only two days to come up with the SQL and to materialize it within Holistics. It then took only another day or so to attach reports to this new data model.

Use such speed to your advantage. Model only what you must.

Embed Business Logic in Data Models, Not Queries

Most of the data modeling layer tools out there encourage you to pre-calculate business metrics within your data model. This allows you to keep your queries simple. It also prevents human errors from occurring.

Let’s take the example of our ‘bounced session’ definition, above. If we had not included it in the sessions model, this would mean that all the data analysts in our company would need to remember exactly how a bounced session is defined by our marketing people. They would write their queries according to this definition, but would risk making subtle errors that might not be caught for months.

Having our bounced sessions defined in our sessions data model meant that our reports could simply SELECT off our model. It also meant that if our marketing team changed their definition of a bounced session, we would only have to update that definition in a single place.

The Goal of Modeling Is Self Service

Like Kimball, we believe that the end goal of modeling is self-service. Self-service is important because it means that your organization is no longer bottlenecked at the data team.

At Holistics, we’ve built our software to shorten the gap between modeling and delivery. But it’s important to note that these ideas aren’t limited to just our software alone. A similar approach using slightly different tools are just as good. For instance, Looker is known for its self-service capabilities. There, the approach is somewhat similar: data analysts model up their raw tables, and then use these models to service business users. The reusability of such models is what gives Looker its power.

Going Forward

We hope this case study has given you a taste of data modeling in this new paradigm.

Use a data modeling layer tool. Use ELT. And what you’ll get from adopting the two is a flexible, easy approach to data modeling. We think this is the future. We hope you’ll agree.