SQL Tips

How to unnest / extract nested JSON data in BigQuery

Ha Pham

Jul 2, 2020 . 6 min read

Working with nested JSON data in BigQuery analytics database might be confusing for people new to BigQuery. Yet if done well, nested data structure (JSON) is a very powerful mechanism to better express hierarchical relationships between entities comparing to the conventional flat structure of tables.

In this article, we will show you how to handle nested data in BigQuery. Beside some obvious naming and syntax differences, you can apply the same approaches to handle nested data in many other SQL databases like PostgreSQL, MySQL...

The examples below use the ga_sessions sample dataset provided by Google. All queries are written with BigQuery's #StandardSQL mode.

How JSON data is stored in BigQuery

In BigQuery, JSON data may be stored in two ways:

In a column of type "RECORD": This data type is specifically designed to store nested structure data (JSON) in BigQuery.
In a column of type "STRING": The JSON value is treated just like a normal string that happens to have JSON format. Let's call this "stringified JSON".

Each scenario needs to be handled differently as follow.

Handling BigQuery's JSON data as RECORD type

What is RECORD type?

While PostgreSQL and MySQL have JSON type, BigQuery uses the RECORD (or STRUCT) type to represent nested structure. A column of RECORD type is in fact a large column containing multiple child columns.

Storing your data in RECORD columns takes a bit more effort as you need to define the data structure up-front when creating your table. For example, this is from the Create table dialogue in BigQuery:

Define the table schema, including schema of nested fields

On the other hand, the explicit structure brings you several benefits:

Consistency: Your future data is made sure to conform to the pre-defined structure → not have to worry that a valid query today will be invalid tomorrow.
Performance: You can selectively scan child columns instead of scanning the whole parent RECORD column, which will save you query money.
Ease of use: You can use dot notation (parent_col.child_col) to query from RECORD columns, which is quite convenient.

Your RECORD column can have a simple nested structure (each row only holds a single record), or it can hold nested and repeated data (each row holds multiple records). In this case, the RECORD field will have REPEATED mode:

Querying key-value JSON object (non-repeated RECORD field)

Querying single (non-repeated) RECORD field is like querying key-value JSON objects. You can simply use the dot notations to access the child columns:

#standardsql
select
  visitId
  , totals.hits
  , totals.pageviews
  , totals.timeOnScreen 
  , trafficSource.campaign 
  , trafficSource.isTrueDirect
  , trafficSource.adwordsClickInfo.adGroupId 
  , trafficSource.adwordsClickInfo.campaignId  
from test.ga_sessions_20170801

You can chain your dot notation to query deeper field values, for example trafficSource.adwordsClickInfo.adGroupId.

Result:

Querying JSON array (REPEATED RECORD fields)

As briefly mentioned above, REPEATED RECORD field is an array that hold multiple records in a single row. To access the records inside, you have to:

Use UNNEST to flatten the field into a table where each row is an element of the array
Join the flattened table back to the base table.

Sample code:

#standardsql
select 
  visitId 
  , h.hitNumber
  , h.time
  , h.hour
  , h.isInteraction
  , h.isEntrance
  , h.isExit
  , h.referer
  , h.page.pagePath
from test.ga_sessions_20170801
left join unnest(hits) as h -- No need to specify join key

Now we have one row for each hit (interaction) in the visit. Notice that visitId value is repeated across four rows:

What if the child column is also a nested column?

In the following example, you can see that hit number three interacted with multiple product SKUs. hits.product is also a REPEATED RECORD:

We can continue to unnest hits.product and join back to the base table:

#standardsql
select 
  visitId 
  , h.hitNumber
  , h.time
  , h.hour
  , h.isInteraction
  , p.productSKU
  , p.v2ProductName 
  , p.productPrice
from test.ga_sessions_20170801
left join unnest(hits) as h
left join unnest(h.product) as p -- No need to specify join key
where visitId = 1501589858

As you can see in the results, each row now represents a product interacted, and values of visitId, hitNumber, time, hour are repeated in each row.

Handling "stringified JSON" in BigQuery

The other common way JSON data is stored in BigQuery is as STRING data type. For example:

`hits` is now a string that has array format

Storing nested data as plain string requires no pre-defined schema, but it will bring you headaches in the future:

Inconsistency: No explicit schema means that you can have a column holding both JSON-formatted strings and normal free-form strings. This may make your query result unpredictable.
Bad performance and high cost: When extracting data from stringified JSON, you have to scan the whole text, which result in high query cost (more money) if your JSON string is large.
Lower ease of use: As demonstrated below, you will need to specify the exact JSON path to the data you want. While this is not much of a hurdle, it still takes some time to get used to.

Similar to RECORD type, stringified JSON can be a single JSON object, or a JSON array. You can check this for BigQuery's documentation on JSON functions.

Handle single-value JSON string in BigQuery

BigQuery has dedicated functions to handle this format:

json_extract(json_expression, json_path) will return all JSON values
json_extract_scalar(json_expression, json_path) will return only scalar values (string, number, boolean)

Sample code:

#standardsql
SELECT 
  visitId
  , json_extract_scalar(totals, '$.hits') as total_hits_scalar -- returns a number
  , json_extract(totals, '$.hits') as total_hits_json -- returns a number
  , json_extract_scalar(trafficSource, '$.adwordsClickInfo') as adwords_scalar -- returns nothing
  , json_extract(trafficSource, '$.adwordsClickInfo') as adwords_json -- returns a JSON object  
FROM test.test_json_string

As the value of hits in totals is a scalar, both functions return the same thing. The value of adwordsClickInfo in trafficSource is a JSON object so json_extract_scalar() returns nothing.

Handle stringified JSON array in BigQuery

With this format, you can use json_extract_array(json_expression[, json_path]) to extract array elements (json_path is optional).

In the example above, hits is a stringified JSON array:

#standardsql
SELECT 
  visitId
  , json_extract_array(hits) as hits
FROM test.test_json_string

As you can see json_extract_array(hits) returns a repeated column. To access elements in the new hits column, you need to unnest it just like with a REPEATED RECORD column:

#standardsql
SELECT 
  visitId
  , json_extract_scalar(h, '$.hitNumber') as hit_number
  , json_extract_scalar(p, '$.productSKU') as product_sku
  , json_extract_scalar(p, '$.productBrand') as product_brand
  , json_extract_scalar(p, '$.productPrice') as product_price
FROM test.test_json_string
left join unnest(json_extract_array(hits)) as h
left join unnest(json_extract_array(h, '$.product')) as p

The result is a nice flattened table:

Handling BigQuery JSON using Holistics (BI tool)

If you are using Holistics BI, there are a few things you can do to enable your end-users to explore BigQuery's nested JSON structures themselves.

Create Calculated Fields to access child columns at run time

If you have created a Table Model out on top of the source table, you can create calculated fields using the same dot notation mentioned above, for examples:

// total_hits
{{#THIS.totals}}.hits

// traffic_adw_google_click_id
{{#THIS.traffic_source}}.adwordsClickInfo.gclId

This way your end-user can drag and drop the child columns like normal during exploration:

Write SQL transformation to flatten arrays into a table

With Holistics's modeling layer, you can let your end-user have access to data in nested JSON arrays by:

Write a SQL model to unnest repeated columns in BigQuery into a flat table
Set a relationship between this derived SQL model with the base model
Add the derived SQL model in a dataset to expose it to your end user

Conclusion

With this, we conclude our guide to handling JSON/nested data structures in BigQuery. While we only wrote this in the context of BigQuery, you can apply the same approach to handle JSON fields and arrays in other SQL databases (like PostgreSQL, MySQL 8, Snowflake...)

If you think we have missed any frequently encountered issue when dealing with nested data, feel free to drop us a message.

Happy unnesting!