Join Merges in Pandas

Working with multiple DataFrames in Pandas is a common thing.  Oftentimes, you will want to merge two or more DataFrames. For instance, one DataFrame could contain customer bio-data while the other could contain their transaction history. In such a case you might want to join the two DataFrames to take advantage of information from both of them during your analysis. In such a case the data could be merged on the unique customer. However, there are several ways of merging that dictate which information will be taken from either DataFrame and which will be omitted. In this article, let’s explore that and more. 

Pandas three-way joining multiple DataFrames on columns

Pandas allows us to merge more than one DataFrame. The only condition here is that the DataFrames need to have a  common column. Let’s illustrate this by creating three DataFrames. They will have a common Name column. 

First, import Pandas and NumPy. Next, create a dictionary containing the data that will be used to create the DataFrames. 

import pandas as pd
import numpy as np
names_dict ={
    ‘Name’:[‘Ken’,’Jeff’,’John’,’Mike’,’Andrew’,’Ann’,’Sylvia’,’Dorothy’,’Emily’,’Loyford’],
    ‘Age’:[31,52,56,12,45,np.nan,78,85,46,135],
    ‘Phone’:[52,79,80,75,43,125,74,44,85,45],
    ‘Uni’:[‘One’,’Two’,’Three’,’One’,’Two’,’Three’,’One’,’Two’,’Three’,’One’]
}
names_dict3 ={
    ‘Name’:[‘Ken’,’Jeff’,’John’,’Mike’,’Andrew’,’Ann’,’Sylvia’,’Dorothy’,’Emily’,’Loyford’],
    ‘Weight’:[23,45,13,45,45,45,43,16,28,56],
    ‘Distance’:[52,79,80,75,43,125,74,44,85,45],
    ‘Skill’:[‘Four’,’Five’,’six’,’Seven’,’Eight’,’Nine’,’Ten’,’Eleven’,’Twelve’,’One’]
}
names_dict2 ={
    ‘Name’:[‘Ken’,’John’,’Mike’,’Andrew’,’Ann’,’Sylvia’,’Dorothy’,’Emily’,’Loyford’],
    ‘Wage’:[24,52,45,14,45,np.nan,12,49,10],
    ‘Height’:[23,43,45,43,34,74,45,56,23],
    ‘Level’:[‘One’,’Three’,’One’,’Two’,’Three’,’One’,’Two’,’Three’,’One’]
}

Next, let’s create three DataFrames using the data above. 

df1 = pd.DataFrame(names_dict)
df2 = pd.DataFrame(names_dict2)
df3 = pd.DataFrame(names_dict3)
dataframe.sample()

Let’s now illustrate several ways of merging the above DataFrames

Using the reduce function 

The reduce function is used to apply a single function to all the elements of a sequence. We can use it to merge all the DataFrames. First, we need to create a list containing all of them. 

dfs = [df1, df2, df3]

After that, we can use it to apply the Pandas merge function on the list of DataFrames. 

from functools import reduce
df_final = reduce(lambda left,right: pd.merge(left,right,on=’Name’), dfs)

The merge function is used when joining DataFrames on the index or columns. When columns are used the indices are ignored and vice versa. In this case, we are merging on the `Name` column. 

Understanding the merging parameters 

By default Pandas, will perform an Inner join. However, this can be changed by specifying the how argument. The options available are:

  • Left. This uses the keys from the left DataFrame only. It is similar to a SQL left outer join. 
  • Right. Uses keys from the right DataFrame only. It is similar to a SQL right outer join. 
  • Outer. Uses the union of keys from both DataFrames. It is similar to a SQL full outer join. 
  • Inner. Uses the intersection of keys from both DataFrames. It is similar to the SQL inner join. 

Let’s now apply the merge function using all the above options and see the difference. 

For instance, applying outer will result in some null values because Pandas will pick data from all DataFrames. 

pd.merge(pd.merge(df1,df2,on=’Name’,how=”outer”),df3,on=’Name’,how=”outer”)
pandas.merge() with an outer join

The inner join doesn’t result in null values because it picks data that matches from all DataFrames. 

pd.merge(pd.merge(df1,df2,on=’Name’,how=”inner”),df3,on=’Name’,how=”inner”)
pandas.merge() with an inner join

The left join will pick everything from the left table. Some null values can be observed since df1 has some null values. 

pd.merge(pd.merge(df1,df2,on=’Name’,how=”left”),df3,on=’Name’,how=”left”)
dataframe.merge() in pandas for python

The right join will pick everything from the right table. The null values result from the fact that df2 doesn’t have Jeff in the Name column. 

pd.merge(pd.merge(df1,df2,on=’Name’,how=”right”),df3,on=’Name’,how=”right”)
dataframe.merge() in pandas for python

When merging DataFrames, you, therefore, have to consider the final result to select the most appropriate way to perform the merging. 

Let’s take a look at another way of merging the above DataFrames. The merge function can be called on a Pandas DataFrame. The DataFrame it’s called on acts as the left DataFrame. You can then define the right DataFrame and then specify how you want the merging to take place. 

df1.merge(df2,on=’Name’).merge(df3,on=’Name’)
dataframe.merge() in pandas for python

What is the difference between join and merge in Pandas?

You have already seen how to work with the merge function in Pandas. The join method is a bit different because it joins the DataFrames on the index or a key column. If the indices are overlapping, you will get this error. 

ValueError: Indexes have overlapping values: Index(['Name'], dtype='object')

To join the DataFrames on the Name column, you first have to set it as the index of the DataFrame. 

dfs = [df.set_index(‘Name’) for df in dfs]

pd.DataFrame().join(dfs,how=”outer”)
dataframe.join() in pandas for python

Alternatively, you can set the index on the individual DataFrames then performing the merging.

df_1 = df1.set_index(‘Name’)
df_2 = df2.set_index(‘Name’)
df_1.join(df_2)
dataframe.join() in pandas for python

If the columns have columns with similar names, you will get this error;

ValueError: columns overlap but no suffix specified: Index(['Age', 'Phone', 'Uni'], dtype='object')

This can be addressed by passing the lsuffix(suffix to use from left DataFrame’s overlapping columns) and rsuffix(suffix to use from right DataFrame’s overlapping columns) arguments. Pandas will then append them to the conflicting columns. 

dataframe.join() in pandas for python

Concatenating Pandas DataFrames

Concatenating Pandas DataFrames is common when you have several DataFrames that contain the same information. In such cases, the second DataFrame can be placed below the first one. This is ideal when both DataFrames have similar and the same number of columns. Otherwise, you will see a lot of null values as seen below. 

pd.concat([df1,df2,df3])
pd.concat() in pandas for python

Final Thoughts

In this article, you have learned how to perform merging and joining on various DataFrames. The join and merge methods can be used interchangeably. However, the merge method is more straightforward. It is also the most commonly used method. 

To reference the code used in this post, please see this Notebook.