Extract Google Analytics Data from BigQuery with Python

Why Query BigQuery?

BigQuery is a seriously powerful data warehousing technology by Google that has direct integration into Google Analytics through GA 360 accounts. The amazing part of the technology is that it can quickly be set up to pull down your GA data at an extremely low level of granularity.

Some of the biggest reasons why you may want to dive into BigQuery directly are below:

  • The data provides unsampled raw data for KPI generation
  • Queries can be integrated into reporting suites such as Looker, Tableau, and Data Studio
  • More granular data can be queried than through the UI
  • Multiple GA accounts can be Unioned together across the same Google BigQuery project

These are just a few of the massive benefits of using the GA 360 BigQuery backend, now we’ll dive into the nitty-gritty of setting up Python so you can execute queries against your GA-BigQuery project.

Import Libraries & Static Variables

In prior posts, we’ve covered extensively how to set up your python environment permissions for BigQuery projects, so we won’t cover those initial details in this post. However, we will review installing the libraries you’ll need to both access BigQuery and to put the data into some kind of meaningful format (a Pandas DataFrame).

So first things first, we’ll need Pandas, so if you don’t have that installed we recommend following the details in the link above for a full overview.

We’ll next need to import specific functions within Pandas itself to get ready to query our data. We recommend the following as the first actions in your script:

from pandas.io import gbq
import pandas as pd
import datetime

Now that we have our libraries loaded, we have one additional item required for querying our BigQuery project – the project id and private key needed to access our project. These will look something like the below but you’ll need to follow our prior post (linked above) to get your own project id and private key.

project_id = 'account_name:api-project-XXXXXXXXXXX'
private_key = 'PROJECT-########.json'

Once we have these details, we’re ready to query the data.

Structure Your BigQuery Query

The import and variable steps are a breeze, where things get tricky are when it comes time to test your queries against the actual data in your BigQuery warehouse.

In the query below we’re essentially searching for several items:

  • Data for the past 365 full days (not including today)
  • The level 1 page path to any given page on our site
  • The Channel Grouping provided by Google Analytics
  • The count of sessions by the two groupings above
  • Pageviews by the two groupings above
  • Bounces by the two groupings above
  • Pages that contain the string “/posts/” (a common WordPress page path slug)

The below query is structured to achieve the extraction of each of these data points:

SELECT 
hits.page.pagePathLevel1,
channelGrouping,  
count(distinct(concat(fullvisitorid, string(visitid)))) as Sessions,  sum(case when hits.type = 'PAGE' then 1 else 0 end) as Pageviews,  sum(totals.bounces) as Bounces
FROM TABLE_DATE_RANGE([112922704.ga_sessions_],
DATE_ADD(CURRENT_TIMESTAMP(), -366, 'DAY'),
DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY')) 
WHERE  hits.page.pagePath contains '/posts/' 
GROUP BY 1, 2

Test Your Query

When running queries there is generally some meta-data we want to store as a result of the query run:

  • When was the query run?
  • What was the title of the query?
  • We want it to write to a file for future use

The below function helps us extract this data in a format usable for each of these use cases.

def bq_execute(q, q_name, private_key=,private_key,project_id=project_id):
    file_name = q_name+"-"+str(datetime.datetime.now())[0:16]+".csv"
    df = gbq.read_gbq(q, project_id,private_key)
    df.to_csv(file_name,index=False,encoding='utf-8')
    return df

The function also returns a DataFrame object containing the contents of the query for analysis within Python itself that can be written to a variable df as follows:

df = bq_execute(query, "yearly_session_data")

The Full Script

When running this script, you may encounter some configuration issues. If this is the case, it is likely due to how your GA project is setup and how you have generated your private key. Please see our prior post for more detail on how to do this correctly, which should answer the majority of your questions.

from pandas.io import gbq
import pandas as pd
import datetime

#You must fill in these details with your own project information
project_id = 'account_name:api-project-XXXXXXXXXXX'
private_key = 'PROJECT-########.json'

#GA Query:
query = 
"""SELECT 
CASE hits.page.pagePathLevel1,  channelGrouping,  count(distinct(concat(fullvisitorid, string(visitid)))) as Sessions,  sum(case when hits.type = 'PAGE' then 1 else 0 end) as Pageviews,  sum(totals.bounces) as Bounces
FROM TABLE_DATE_RANGE([112922704.ga_sessions_],
DATE_ADD(CURRENT_TIMESTAMP(), -366, 'DAY'),
DATE_ADD(CURRENT_TIMESTAMP(), -1, 'DAY')) 
WHERE  hits.page.pagePath contains '/posts/' 
GROUP BY 1, 2"""

#Function for query BigQuery using Pandas:
def bq_execute(q, q_name, private_key=,private_key,project_id=project_id):
    file_name = q_name+"-"+str(datetime.datetime.now())[0:16]+".csv"
    df = gbq.read_gbq(q, project_id,private_key)
    df.to_csv(file_name,index=False,encoding='utf-8')
    return df

#Execution of the query script writing the output to a DataFrame
df = bq_execute(query, "yearly_session_data")

Summary

The script outlined in this post should help you get a quick framework for how to query your Google Analytics BigQuery project data. The script is quickly customizable and can be altered to use any valid query for the raw Google Analytics data.

Andrew W. Owens

Analytics and sciences contributor and professional. Specializing in Python and GCP.

Recent Posts

Adding rows to a Pandas Dataframe

While studying Data Science, we often come across DataFrames ready to be used. Normally, those…

6 days ago

How to Install & Import Pandas in Python

Pandas is one of the most powerful libraries for data analysis and is the most…

2 weeks ago

Decision Trees in Scikit-Learn

Introduction The decision tree is a machine learning algorithm which perform both classification and regression.…

3 weeks ago

A Holistic Guide to Groupby Statements in Pandas

The Importance of Groupby Functions In Data Analysis Whether working in SQL, R, Python, or…

4 weeks ago

Logistic Regression in Sci-Kit Learn

Introduction Logistic regression is an important model used in supervised learning. You can use logistic…

1 month ago

Pandas-Profiling, explore your data faster in Python

All datasets have one obvious thing in common, information, but this information is easy and…

1 month ago