Categories: Pandas

Create a DataFrame or Series from a List or Dictionary

Use Pandas Series or DataFrames to make your data life easier

In this article, we will take you through one of the most commonly used methods to create a DataFrame or Series – from a list or a dictionary, with clear, simple examples.

Introduction

Pandas is the go-to tool for manipulating and analysing data in Python. If you have been dabbling with data analysis, data science, or anything data-related in Python, you are probably not a stranger to Pandas. Pandas is a very feature-rich, powerful tool, and mastering it will make your life easier, richer and happier, for sure. (Well, as far as data is concerned, anyway.) 

Just as a journey of a thousand miles begins with a single step, we actually need to successfully introduce data into Pandas in order to begin to manipulate and analyse data.

Series and DataFrames are the core data types used in Pandas for data analysis. At its core, Pandas is built on top of Numpy, and if you are not familiar with them, it is probably easiest to think of Series as a Pandas equivalent of a one-dimensional array, and a DataFrame as a two-dimensional array, composed of multiple Series. Another analogy would be to a spreadsheet, where a Series is essentially a single column of data, whereas a DataFrame is like an entire sheet.

Let’s begin to explore a few of the many ways that exist to create them. For our dummy data, I will use continental data from the GapMinder dataset.

Import pandas

We can’t do anything without importing the pandas module. The convention is to import pandas as pd to save our precious keystrokes (and numpy as np).

import pandas as pd
import numpy as np

Creating a Series

From a list

Creating a Series is easy. Simply passing a list to the pd.Series function will convert that list to a Series. Try creating a new Series object with:

continents = pd.Series(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'])

Inspecting the new object with type(continents) reveals to us that it is a pandas.core.series.Series object. Also, typing continents into the Python shell will show the contents.

So, why use a Series over a list? Well, there is far more you can do with a Series than you can with a list. For example, you can create a Series with an explicit index (like a Python dictionary). Try:

continents_l = pd.Series(
    ['Asia', 'Europe', 'Africa', 'Americas', 'Oceania']
    , index=['A', 'B', 'C', 'D', 'E']
)

This will create a series, where each row can be addressed with the letter index, like:

continents_l['D']

And behind the scenes, each row retains a numerical index – try addressing the same row with:

continents_l.iloc[3]

Both approaches should result in 'Americas'! It simply gives you more options.

Seeing that Series are in some ways, fancy dictionary objects, it would be no surprise that dictionaries can be used to create Series objects.

From a dictionary

The basic structure of creating a Series object from a dictionary is simple – pass a dictionary to the pd.Series function, with the dictionary in the format {'index': 'value'}.

So, to create the same dictionary as ours above, we can write:

continents_d = pd.Series({'A': 'Asia', 'B': 'Europe', 'C': 'Africa', 'D': 'Americas', 'E': 'Oceania'})

Now, we can check that our two Series objects are the same, by:

continents_d == continents_l

Which produces a Series value with ‘True’ outputs for every row. Interestingly, an attempt to compare the dictionary-based Series (continents_d) with the first Series (continents) that we created results in an exception – try it out.

(It should produce a ValueError: Can only compare identically-labeled Series objects)

So, why create a DataFrame, or Series for that matter? Well, it will become clearer later on, but the short answer is that manipulating data inside the object becomes much, much easier with a Series than from inside a dictionary.Before we move on, try this simple example. Let’s say that we wanted to append the text 'Continent: ' as a prefix to all of the values. Well, through the magic of broadcasting, all we would need to do is this:

'Continent: ' + continents_l

And it will add the string 'Continent: ' to all of the values in our Series. Granted, this would not be that difficult to do to a dictionary either, but you can begin to see how it would become easier to operate on columns of data, rather than have to always loop, or use comprehensions.

Similar numerical operations are possible for integer, or floating point data types inside our Series. But we’ll come back to those briefly. For now, let’s move onto creating DataFrames.

Create a DataFrame

Because a DataFrame is multi-dimensional, we can create it either based on multiple columns, or multiple rows. There is no right or wrong way, and one method may be preferable depending on the data that you start with. It can, however, get a little confusing – but, stay with me here. I promise, it will be fine once you get the hang of it.

From Series

Much as a numpy array does not need to have multiple columns, a DataFrame can have just a single column. A single-column DataFrame could be created from a Series by:

continents_df = pd.DataFrame(continents_l, columns=['Continents'])

Once again, feel free to check its type by entering type(continents_df) or continents_df into the shell. You will notice that it looks largely the same, although the object type is now a DataFrame (pandas.core.frame.DataFrame).

Creating a DataFrame from multiple Series, the easiest thing is to pass them as dictionary key:value pairs, where the key is the desired column name.

Let’s say that we have a Series for the population figures (from 2007), created as:

continent_pop = pd.Series([3811953827, 586098529, 929539692, 898871184, 24549947])

Then, a DataFrame can be created by simply passing a dictionary as follows:

continents_df = pd.DataFrame({'continents': continents, 'population': continent_pop})

Pretty simple, right? Now, let’s look another popular way creating a DataFrame.

From a list (of dicts)

Above, we created a DataFrame from a base unit of Series. Each Series was essentially one column, which were then added to form a complete DataFrame. Remember that each Series can be best understood as multiple instances of one specific type of data. Above, continent names were one series, and populations were another.

Here, let’s approach it from another angle – by adding rows together, where each row is a data entry that includes multiple properties.

Intuitively, think about this approach as though we are adding a database entry or an Excel row. The continent example with an entry with properties of name, and population still works. Another example would be an contacts database that we are adding our friends onto, where each row is a person, and it may have properties such as name, age, phone number, email address and so on.

Practically, these use cases may arise if we are creating the data via a loop. Simply, the list will comprise of dictionaries, and each dictionary will have the structure {‘column name’: value}.

Confused? Don’t worry, that was a lot of information! Let’s take a look at a real example. Here is one that I created earlier:

temp_list = list()
for i in range(5):
    temp_dict = {'title': 'A' * (i+1), 'value': i}
    temp_list.append(temp_dict)
temp_df = pd.DataFrame(temp_list)

At the start of the loop, when i is 0, it creates a temp_dict dictionary which is {'title': 'A', 'value': 0}, and added to the list (temp_list). The loop continues on until i is 4, whereupon a dictionary {'title': 'AAAAA', 'value': 4} is added to the list.

Now, the complete list of dictionaries is passed onto the pd.DataFrame function, to create the resulting DataFrame temp_df.

Easy, right?

Putting it together

If we were to duplicate our continents_df which we created above using this method, what should our list look like?

Well, the DataFrame included two columns ‘continents’ and ‘population’, so each dictionary will be in the format: {'continents': continent, 'population': continent_pop}.

Does that look familiar? It should, because it’s the same structure that we used above, in using Series to create our DataFrame. The main difference is this – where we used Series as columns, the format was:

{'continents': continent, 'population': continent_pop}

Where each continent and continent_pop value was a Series.

Now, we will be using:

[{'continents': continent_1, 'population': continent_pop_1},
{'continents': continent_2, 'population': continent_pop_2},
...
{'continents': continent_n, 'population': continent_pop_n}]

where each dictionary value is a single value. Give it a try yourself, and create the continents_df DataFrame with its 5 rows. I will leave it as an exercise.


I hope that the above tutorial on how to create a DataFrame or Series from a list or a dictionary was useful to you.

Pandas and DataFrames are such powerful, flexible tools, and I personally find these to be good methods to create a DataFrame or Series to manipulate. Once you start to generate your own DataFrames, you will also start to see when each of these methods begin to come in handy, and why they are useful tools to have in your Data Science toolbelt.

If you’re looking for examples of what can be done with Pandas, check out this article on Exploring Excel data with Pandas.

Let us know if you have any questions, and see you next time!

JP Hwang

JP is a data visualisation freelancer & writer, with a keen interest in sports analytics. He has a dark past as an engineer and a patent attorney but hopes it won't be held against him too much, at least to his face.

Recent Posts

Adding rows to a Pandas Dataframe

While studying Data Science, we often come across DataFrames ready to be used. Normally, those…

6 days ago

How to Install & Import Pandas in Python

Pandas is one of the most powerful libraries for data analysis and is the most…

2 weeks ago

Decision Trees in Scikit-Learn

Introduction The decision tree is a machine learning algorithm which perform both classification and regression.…

3 weeks ago

A Holistic Guide to Groupby Statements in Pandas

The Importance of Groupby Functions In Data Analysis Whether working in SQL, R, Python, or…

4 weeks ago

Logistic Regression in Sci-Kit Learn

Introduction Logistic regression is an important model used in supervised learning. You can use logistic…

1 month ago

Pandas-Profiling, explore your data faster in Python

All datasets have one obvious thing in common, information, but this information is easy and…

1 month ago