Analytics
Google Analytics 4 (GA4) offers excellent functionality with its GA4 BigQuery dataset. However, you might find it challenging to align your GA4 User Interface (UI) figures with the unnested sources, mediums, and campaigns in BigQuery.
So, what happens in BigQuery compared to GA4?
When data is generated, it’s sent directly to BigQuery as raw data without any pre-processing. However, once it’s in GA4, Google applies various models and transformations. This includes joining data, modeling session traffic, adding cost data, and replacing ISO2 codes with language names.
Matching up your numbers
If you unnest the data in GA4 to extract the session source, you’ll notice significant discrepancies between your GA4 UI and GA4 BigQuery data, particularly if you have extensive paid and social activity.
The sample code below displays the raw traffic source, resulting in more users appearing as ‘direct’ or ‘organic’ than in the UI. This discrepancy arises because the data is unprocessed.
SELECT
DISTINCT
(select value.string_value from unnest(event_params) where key = 'source') AS source,
(select value.string_value from unnest(event_params) where key = 'medium') AS medium,
count(distinct(concat(user_pseudo_id,'.',
(select cast(value.int_value as string) from unnest(event_params) where key = 'ga_session_id')))) AS session_count
FROM `analytics_123456789.events_20230831`
WHERE event_name = 'page_view'
AND (select value.int_value from unnest(event_params) where key = 'entrances') =1
GROUP BY 1,2
ORDER BY 3 DESC
;
To mirror the GA4 UI, you’ll need to further customise this code. This is due to Google’s handling of direct and organic traffic for sessions in the UI. Specifically, if a user comes from a direct source, the system will use the user’s traffic source.
This observation is based on our extensive testing and data analysis. By making this adjustment, the figures align more closely with the GA4 UI, clarifying why CPC mediums might appear much lower in GA4 BigQuery compared to the GA4 UI.
SELECT
DISTINCT
( case when (select value.string_value from unnest(event_params) where key = 'medium') is null then traffic_source.source
when (select value.string_value from unnest(event_params) where key = 'medium') = 'organic' and regexp_contains(traffic_source.medium, '(cpc|paid|social|email|referral)') then traffic_source.source
else
(select value.string_value from unnest(event_params) where key = 'source') end) as source,
( case when (select value.string_value from unnest(event_params) where key = 'medium') is null then traffic_source.medium
when (select value.string_value from unnest(event_params) where key = 'medium') = 'organic' and regexp_contains(traffic_source.medium, '(cpc|paid|social|email|referral)') then traffic_source.medium
else
(select value.string_value from unnest(event_params) where key = 'medium') end) as medium,
count(distinct(concat(user_pseudo_id,'.',(select cast(value.int_value as string) from unnest(event_params) where key = 'ga_session_id')))) as session_count
FROM `analytics_123456789.events_20230831`
WHERE event_name = 'page_view'
AND (select value.int_value from unnest(event_params) where key = 'entrances') =1
#and ( (select value.string_value from unnest(event_params) where key = 'medium') is not null
#and (select value.string_value from unnest(event_params) where key = 'medium') <> 'organic')
GROUP BY 1,2
ORDER BY 3 DESC
;
An important point to mention is the joining of session_id with user_pseudo_id. This has been done because session_id isn’t unique on its own, so pairing it with user_pseudo_id addresses this issue. Even after this adjustment, you’ll observe some discrepancies, though they’re more aligned with the GA4 UI. These discrepancies arise for various reasons, including:
- Googles data driven models
- The use of approx. count in the UI
- The reporting ID you have set
- Timezone differences: by default the event_timestamp is in UTC on your BigQuery dataset.
It’s essential to consider several factors. While modelling your data to align with the GA4 UI is advantageous for reporting, there’s value in understanding the genuine origins of your sessions without any modelling.
For instance, if a user session is from a direct source, it might indicate that users have bookmarked or favourited your website and are potentially more engaged. Always keep in mind the specific scope of your analysis and reporting, and choose the approach that delivers the data you need.
If you have any questions on GA4 UI replication – or your GA4 configuration – get in touch with our team.