Despite the mass investment by third parties to provide API access to reports and data that their customers want, email still remains a fundamental part of the data transfer process. For Google Analytics, they provide a litany of backend API, Data Export, and other services to provide data to their customers for further analysis, however, sending data around by email is still a key feature in their application. Often times, you may not be able to access an API for security reasons and email is the last resort.
In this tutorial, we show you how to extract data from emails sent from Google Analytics to a Gmail account. We’ll be using only the Python Standard Library, imaplib, and email to achieve this.
The first thing we need to think about when accessing email accounts is security. In Google, if you have 2-Step Verification turned on (which you should, so if you haven’t go turn it on), you will need to create an individualized App Password for accessing your account from a given application – in this case, our Python interpreter.
Go to the Security settings of your account in Google and look for the “Signing in to Google” section where you should see “App passwords” as shown below. Click on this link and setup your password to read from your Gmail and your local machine, or if you plan to use a server, select “Other”.
Once you’ve setup your password, you’ll need to save it somewhere secure so you can use it to access your Gmail emails in the Python script. If you try to proceed with this tutorial without setting up an app password and you have 2-Step Authentication on, you’ll quickly run into this error:
The overall structure of the script we’ll use today is as follows:
What we won’t get into in this post is the detailed setups of cleaning up messy file formats in CSV (or any other format). We’re just interested in downloading these files from email for further cleaning and analysis down the line. That process of manipulating and cleansing files for analysis we’ll cover in a later post.
As you can see from the below data file we’re going to ingest in this post (a completely arbitrary file sent from Google Analytics), the file formatting is not immediately useful for analysis once downloaded, there are many steps to take after downloading to make this data useful for analysis.
Before you get started working through your Python code, you need to take a look at the basic structure of the emails that you’re receiving. Firstly, the imap and email servers have special means of searching for content. You do not want to download all the files on your email account, as if you’re like me, that will represent gigabytes of historic email contents.
The items to pay attention to for structuring your ability to pull down files from emails are:
Once you understand a bit more about the structure of your email, you can proceed to coding out how to pull down the email contents.
The first step in our Python script is to, as always, import the libraries we’re going to use in our script:
import pandas as pd import datetime, os, glob import email, imaplib
The next step is to enter in some hard-coded variables into your script for accessing your computers current working directory as well as the connection details to Gmail:
cwd = os.getcwd() EMAIL_UN = 'EMAIL' EMAIL_PW = 'GOOGLE APP PROVIDED PASSWORD'
The next step is to structure a function to generate the subject header details that will be used to search your email server for, in this case, a unique email. In our case, we want to match the exact subject header of our email and only pull down the file for a specific day. This last piece of logic is very important if you have multiple emails being delivered every day.
def details(subject_header,date=(datetime.datetime.now()-datetime.timedelta(1)).strftime("%d-%b-%Y")): #EMAIL SEARCH CRITERIA search_criteria = '(ON '+date+' SUBJECT "'+subject_header+'")' return search_criteria
The longest structure of our code* will be what’s used to actually access our email using imap and the email libraries. The most important functions of the script are the following:
def attachment_download(SUBJECT): un = EMAIL_UN pw = EMAIL_PW url = 'imap.gmail.com' detach_dir = '.' # directory where to save attachments (default: current) # connecting to the gmail imap server m = imaplib.IMAP4_SSL(url,993) m.login(un,pw) m.select() resp, items = m.search(None, SUBJECT) # you could filter using the IMAP rules here (check http://www.example-code.com/csharp/imap-search-critera.asp) items = items.split() # getting the mails id for emailid in items: resp, data = m.fetch(emailid, "(RFC822)") # fetching the mail, "`(RFC822)`" means "get the whole stuff", but you can ask for headers only, etc email_body = data # getting the mail content mail = email.message_from_string(str(email_body)) # parsing the mail content to get a mail object #Check if any attachments at all if mail.get_content_maintype() != 'multipart': continue print("["+mail["From"]+"] :" + mail["Subject"]) # we use walk to create a generator so we can iterate on the parts and forget about the recursive headach for part in mail.walk(): # multipart are just containers, so we skip them if part.get_content_maintype() == 'multipart': continue # is this part an attachment: if part.get('Content-Disposition') is None: continue filename = part.get_filename() counter = 1 # if there is no filename, we create one with a counter to avoid duplicates if not filename: filename = 'part-%03d%s' % (counter, 'bin') counter += 1 att_path = os.path.join(detach_dir, filename) #Check if its already there if not os.path.isfile(att_path): # finally write the stuff fp = open(att_path, 'wb') fp.write(part.get_payload(decode=True)) fp.close() print(str(filename)+ ' downloaded') return filename
Running your script through each of these functions should end up with a fully downloaded email file. Not only can this be used on CSV files, but files of truly any type.
For more information on the script we generated here today, you can access the code on our GitHub account, here.
*This code was originally pulled from a post that I can no longer find publicly available and I’ve been using it since 2015.
All datasets have one obvious thing in common, information, but this information is easy and…