Working with multiple DataFrames in Pandas is a common thing. Oftentimes, you will want to merge two or more DataFrames. For instance, one DataFrame could contain customer bio-data while the other could contain their transaction history. In such a case you might want to join the two DataFrames to take advantage of information from both of them during your analysis. In such a case the data could be merged on the unique customer. However, there are several ways of merging that dictate which information will be taken from either DataFrame and which will be omitted. In this article, let’s explore that and more.
Pandas three-way joining multiple DataFrames on columns
Pandas allows us to merge more than one DataFrame. The only condition here is that the DataFrames need to have a common column. Let’s illustrate this by creating three DataFrames. They will have a common
First, import Pandas and NumPy. Next, create a dictionary containing the data that will be used to create the DataFrames.
|import pandas as pd|
import numpy as np
Next, let’s create three DataFrames using the data above.
|df1 = pd.DataFrame(names_dict)|
df2 = pd.DataFrame(names_dict2)
df3 = pd.DataFrame(names_dict3)
Let’s now illustrate several ways of merging the above DataFrames
Using the reduce function
reduce function is used to apply a single function to all the elements of a sequence. We can use it to merge all the DataFrames. First, we need to create a list containing all of them.
|dfs = [df1, df2, df3]|
After that, we can use it to apply the Pandas
merge function on the list of DataFrames.
|from functools import reduce|
df_final = reduce(lambda left,right: pd.merge(left,right,on=’Name’), dfs)
merge function is used when joining DataFrames on the index or columns. When columns are used the indices are ignored and vice versa. In this case, we are merging on the `Name` column.
Understanding the merging parameters
By default Pandas, will perform an
Inner join. However, this can be changed by specifying the
how argument. The options available are:
- Left. This uses the keys from the left DataFrame only. It is similar to a SQL left outer join.
- Right. Uses keys from the right DataFrame only. It is similar to a SQL right outer join.
- Outer. Uses the union of keys from both DataFrames. It is similar to a SQL full outer join.
- Inner. Uses the intersection of keys from both DataFrames. It is similar to the SQL inner join.
Let’s now apply the
merge function using all the above options and see the difference.
For instance, applying
outer will result in some null values because Pandas will pick data from all DataFrames.
The inner join doesn’t result in null values because it picks data that matches from all DataFrames.
The left join will pick everything from the left table. Some null values can be observed since
df1 has some null values.
The right join will pick everything from the right table. The null values result from the fact that
df2 doesn’t have
Jeff in the
When merging DataFrames, you, therefore, have to consider the final result to select the most appropriate way to perform the merging.
Let’s take a look at another way of merging the above DataFrames. The
merge function can be called on a Pandas DataFrame. The DataFrame it’s called on acts as the left DataFrame. You can then define the right DataFrame and then specify how you want the merging to take place.
What is the difference between join and merge in Pandas?
You have already seen how to work with the
merge function in Pandas. The
join method is a bit different because it joins the DataFrames on the index or a key column. If the indices are overlapping, you will get this error.
ValueError: Indexes have overlapping values: Index(['Name'], dtype='object')
To join the DataFrames on the
Name column, you first have to set it as the index of the DataFrame.
|dfs = [df.set_index(‘Name’) for df in dfs]|
Alternatively, you can set the index on the individual DataFrames then performing the merging.
|df_1 = df1.set_index(‘Name’)|
df_2 = df2.set_index(‘Name’)
If the columns have columns with similar names, you will get this error;
ValueError: columns overlap but no suffix specified: Index(['Age', 'Phone', 'Uni'], dtype='object')
This can be addressed by passing the
lsuffix(suffix to use from left DataFrame’s overlapping columns) and
rsuffix(suffix to use from right DataFrame’s overlapping columns) arguments. Pandas will then append them to the conflicting columns.
Concatenating Pandas DataFrames
Concatenating Pandas DataFrames is common when you have several DataFrames that contain the same information. In such cases, the second DataFrame can be placed below the first one. This is ideal when both DataFrames have similar and the same number of columns. Otherwise, you will see a lot of null values as seen below.
In this article, you have learned how to perform merging and joining on various DataFrames. The join and merge methods can be used interchangeably. However, the merge method is more straightforward. It is also the most commonly used method.
To reference the code used in this post, please see this Notebook.