Categories: Python

Extracting Data From Gmail Emails With Python

Despite the mass investment by third parties to provide API access to reports and data that their customers want, email still remains a fundamental part of the data transfer process. For Google Analytics, they provide a litany of backend API, Data Export, and other services to provide data to their customers for further analysis, however, sending data around by email is still a key feature in their application. Often times, you may not be able to access an API for security reasons and email is the last resort.

In this tutorial, we show you how to extract data from emails sent from Google Analytics to a Gmail account. We’ll be using only the Python Standard Library, imaplib, and email to achieve this.

Getting Your Gmail Account Setup & Secure

The first thing we need to think about when accessing email accounts is security. In Google, if you have 2-Step Verification turned on (which you should, so if you haven’t go turn it on), you will need to create an individualized App Password for accessing your account from a given application – in this case, our Python interpreter.

Go to the Security settings of your account in Google and look for the “Signing in to Google” section where you should see “App passwords” as shown below. Click on this link and setup your password to read from your Gmail and your local machine, or if you plan to use a server, select “Other”.

App Passwords with Gmail

Once you’ve setup your password, you’ll need to save it somewhere secure so you can use it to access your Gmail emails in the Python script. If you try to proceed with this tutorial without setting up an app password and you have 2-Step Authentication on, you’ll quickly run into this error:

Structuring The Python Code

The overall structure of the script we’ll use today is as follows:

  • Import required libraries
  • Input hard-coded variables
  • Generate functions for:
    • Creating the search logic of our Inbox
    • Accessing the contents of a specific email to access a file for download
    • Saving the given file

What we won’t get into in this post is the detailed setups of cleaning up messy file formats in CSV (or any other format). We’re just interested in downloading these files from email for further cleaning and analysis down the line. That process of manipulating and cleansing files for analysis we’ll cover in a later post.

As you can see from the below data file we’re going to ingest in this post (a completely arbitrary file sent from Google Analytics), the file formatting is not immediately useful for analysis once downloaded, there are many steps to take after downloading to make this data useful for analysis.

Example/Arbitrary Google Analytics Email

Inspecting the Email Structure

Before you get started working through your Python code, you need to take a look at the basic structure of the emails that you’re receiving. Firstly, the imap and email servers have special means of searching for content. You do not want to download all the files on your email account, as if you’re like me, that will represent gigabytes of historic email contents.

The items to pay attention to for structuring your ability to pull down files from emails are:

  • The Subject of the email (see below)
  • The Sender
  • The Attachment File Name
  • The Datetime/Received time of the email

Once you understand a bit more about the structure of your email, you can proceed to coding out how to pull down the email contents.

Pulling Down Email Contents

The first step in our Python script is to, as always, import the libraries we’re going to use in our script:

import pandas as pd
import datetime, os, glob
import email, imaplib

The next step is to enter in some hard-coded variables into your script for accessing your computers current working directory as well as the connection details to Gmail:

cwd = os.getcwd()

EMAIL_UN = 'EMAIL'
EMAIL_PW = 'GOOGLE APP PROVIDED PASSWORD'

The next step is to structure a function to generate the subject header details that will be used to search your email server for, in this case, a unique email. In our case, we want to match the exact subject header of our email and only pull down the file for a specific day. This last piece of logic is very important if you have multiple emails being delivered every day.

Generate the email search criteria
def details(subject_header,date=(datetime.datetime.now()-datetime.timedelta(1)).strftime("%d-%b-%Y")):
    #EMAIL SEARCH CRITERIA
    search_criteria = '(ON '+date+' SUBJECT "'+subject_header+'")'
    return search_criteria

The longest structure of our code* will be what’s used to actually access our email using imap and the email libraries. The most important functions of the script are the following:

  • Logging into Gmail using your credentials at the imap.gmail.com url
  • Searching your email server for the given Subject generated
  • Iterating through the results of your email contents
  • Writing the attached file to the current working directory detailed earlier in our script
def attachment_download(SUBJECT):
    un = EMAIL_UN
    pw = EMAIL_PW
    url = 'imap.gmail.com'

    detach_dir = '.' # directory where to save attachments (default: current)
    # connecting to the gmail imap server
    m = imaplib.IMAP4_SSL(url,993)
    m.login(un,pw)
    m.select() 
    resp, items = m.search(None, SUBJECT)
    # you could filter using the IMAP rules here (check http://www.example-code.com/csharp/imap-search-critera.asp)
    
    items = items[0].split() # getting the mails id
    
    for emailid in items:
        resp, data = m.fetch(emailid, "(RFC822)") # fetching the mail, "`(RFC822)`" means "get the whole stuff", but you can ask for headers only, etc
        email_body = data[0][1] # getting the mail content
        mail = email.message_from_string(str(email_body)) # parsing the mail content to get a mail object
    
        #Check if any attachments at all
        if mail.get_content_maintype() != 'multipart':
            continue
    
        print("["+mail["From"]+"] :" + mail["Subject"])
    
        # we use walk to create a generator so we can iterate on the parts and forget about the recursive headach
        for part in mail.walk():
            # multipart are just containers, so we skip them
            if part.get_content_maintype() == 'multipart':
                continue
    
            # is this part an attachment:
            if part.get('Content-Disposition') is None:
                continue
    
            filename = part.get_filename()
            counter = 1
    
            # if there is no filename, we create one with a counter to avoid duplicates
            if not filename:
                filename = 'part-%03d%s' % (counter, 'bin')
                counter += 1
    
            att_path = os.path.join(detach_dir, filename)
    
            #Check if its already there
            if not os.path.isfile(att_path):
                # finally write the stuff
                fp = open(att_path, 'wb')
                fp.write(part.get_payload(decode=True))
                fp.close()
        print(str(filename)+ ' downloaded')
        return filename

Summary

Running your script through each of these functions should end up with a fully downloaded email file. Not only can this be used on CSV files, but files of truly any type.

For more information on the script we generated here today, you can access the code on our GitHub account, here.

*This code was originally pulled from a post that I can no longer find publicly available and I’ve been using it since 2015.

Andrew W. Owens

Analytics and sciences contributor and professional. Specializing in Python and GCP.

Recent Posts

Matplotlib Visualizations 101

Introduction In this article, we are going to get a detailed explanation of Matplotlib Visualizations in Python. Matplotlib is the…

4 days ago

Pandas: An Open Source Library for Python

A Brief Introduction Pandas is an Open Source library built on top of NumPy. It allows for fast analysis and…

1 month ago

Tips for Performing EDA With Python

What is Exploratory Data Analysis (EDA)? EDA with Python is a critical skill for all data analysts, scientists, and even…

2 months ago

Concatenate, Merge, And Join Data with Pandas

Importance of Merging & Joining Data Many need to join data with Pandas, however there are several operations that are…

2 months ago

What is Pandas for Data Analysis?

Pandas is one of the most popular libraries for data analysis in the world and is growing rapidly. But, what…

2 months ago

Transform JSON Into a DataFrame

JSON is one of the most common data formats available in digital and non-digital applications. As a result, there it…

3 months ago