Categories: Pandas

Transform JSON Into a DataFrame

JSON is one of the most common data formats available in digital and non-digital applications. As a result, there it is critical to understand how to transform JSON into a Pandas DataFrame for manipulation. A part of the challenge with JSON is the variety it’s purposely unstructured format can take, and what that will mean for the construction of a DataFrame from it’s contents.

In this post, we’ll cover several fundamental functions within Pandas to transform JSON data from its raw form into DataFrames. We covered one of these approaches in a prior post, but we will expand on this topic beyond simply using the json_normalize function to using read_json as well.

Sample JSON Datasets

We had previously covered the CitiBike system update dataset in a prior post and will continue to use its extensive JSON API in this tutorial. We’ll be specifically looking at the Station Status feed which shows which bikes are available at each station throughout the city at any given time. It’s a straight forward example of JSON data structures and helps us understand how to import the data into a DataFrame.

Using json_normalize

The most important JSON import function in Pandas is json_normalize which unnests JSON data into a columnar format for further analysis.

Before we can begin using Python to transform JSON data, we need to import the necessary libraries that will make analysis possible. These include Pandas, Requests, & JSON.

import pandas, json_normalize, & json
import requests
import pandas as pd
from pandas.io.json import json_normalize
import json

Next we will access the API using Requests in a simple GET call to pull down the data from the feed into our Python environment. The resulting data, which can be seen by navigating to the URL itself, will show its values under r.json().

url = "https://gbfs.citibikenyc.com/gbfs/en/station_information.json"
r = requests.get(url)
r.json()

The next step is to explore the contents of the JSON object by using .keys() function to see the nested layers of data stored in the object. In our case, we’re interested most in the ‘data’ tab of the JSON feed.

r.json().keys()
r.json()['data'].keys()

The data we’re truly looking to access is stored under data and stations keys in the JSON object. The contents should look like the below:

Once we confirm we have the right level of the JSON object, we can then use the json_normalize function to transform that data into a DataFrame as seen below:

json_normalize(JSON)
r.json()['data']['stations']
stations = r.json()['data']['stations']
df = json_normalize(stations)
df.head()

Using from_dict

One additional way of converting data from a JSON object to a DataFrame is to use the from_dict function. This said, there is one caveat here, we must confirm that the object we have stored is of type ‘dict’ once read into a variable in Python. Python automatically does this regularly with JSON objects, but not all the time. You can check this by using type(data) to your data object to confirm its type. If it is a dictionary, we can then read the data into a DataFrame as seen below:

type(r.json())
df = pd.DataFrame.from_dict(r.json()['data']['stations'])

Use read_json

The third approach to reading JSON objects into a DataFrame is to use the read_json function in Pandas. A JSON object can be read straight into this function, or as in our case – we can use the URL of a JSON feed as the initial object to read.

Once the data is stored however, we can see that it is not necessarily at the stations data level that we want to extract (as seen earlier in the tutorial). In order to extract the level of data we want to store in a DataFrame, we must access the series data within our newly created DataFrame (assigned to the value d and create a loop to export all its contents back into our top line DataFrame d2.

d = pd.read_json("https://gbfs.citibikenyc.com/gbfs/en/station_information.json")
d.data[0]
d2 = pd.DataFrame([i for i in d.data[0]])
d2.head()

Summary

json_normalize & read_json are the two critical functions within Pandas to reading JSON data into a DataFrame for further analysis. As you can see from our tutorial, in some instances data does not come into the DataFrame smoothly and requires a bit more unnesting to be at a perfectly columnar level. Solving for that is not in the scope of this article, but some details can be found within the json_normalize documentation using the Union argument.

For the code from this tutorial and screenshots surfaced in the above, please visit our GitHub repository on Data Analysis and the specific Jupyter Notebook for this post, here.

To see further reading about this topic, see the below sources of information:

Andrew W. Owens

Analytics and sciences contributor and professional. Specializing in Python and GCP.

Recent Posts

Matplotlib Visualizations 101

Introduction In this article, we are going to get a detailed explanation of Matplotlib Visualizations in Python. Matplotlib is the…

4 days ago

Pandas: An Open Source Library for Python

A Brief Introduction Pandas is an Open Source library built on top of NumPy. It allows for fast analysis and…

1 month ago

Tips for Performing EDA With Python

What is Exploratory Data Analysis (EDA)? EDA with Python is a critical skill for all data analysts, scientists, and even…

2 months ago

Concatenate, Merge, And Join Data with Pandas

Importance of Merging & Joining Data Many need to join data with Pandas, however there are several operations that are…

2 months ago

Extracting Data From Gmail Emails With Python

Despite the mass investment by third parties to provide API access to reports and data that their customers want, email…

2 months ago

What is Pandas for Data Analysis?

Pandas is one of the most popular libraries for data analysis in the world and is growing rapidly. But, what…

2 months ago