Categories: PandasPython

Using Pandas to explore data in Excel files

Introduction:

When it comes to Data Science, we need to talk about data, and data comes in a lot of different shapes, it can be a “.csv”, a “.json”, or even in the proprietary format “.xls” (Used by Microsoft Excel 2003 and earlier versions) and the more recent Excel format, the “.xlsx”. Therefore, this article was designed to serve the reader with the necessary knowledge of the Pandas library to deal with this type of data.

Since a data scientist has to be prepared for different sources of data, it’s important to know some valuable tools available in the “pandas” library. Some of those tools convert a .xls or .xlsx file to a useful “pandas” “data frame” ready to be fed to a Python data pipeline.

Two sets of these tools will be explored along with the text, the first group of tools is formed by the read_excel and to_excel functions from pandas, they offer a more basic interface with the process of reading excel data but can get the job done. After that, to show more advanced “.xlsx” file manipulations, such as appending data to an already existing file, we will use a different class called ExcelWriter.

Let’s get started! However, we need some data, like “Sherlock Holmes” once said “‘ Data! Data! Data!’ he cried impatiently. ‘I can’t make bricks without clay.'”. For instance, we are going to use a dataset that contains ratings of ramen, the Japanese food, it is originable available on the following link: Ramen Ratings – Ramen Eater Website. On the other hand, if the data is no longer available, you can download it from here on my Github.

The dataset contains exactly 3400 ramen reviews, some of them need some data cleaning. So, we’re going to use some Pandas functionalities to clean data. The following columns are included: [Review Number, Brand, Variety, Style, Country, Stars] the dataset it’s also featured on the Kaggle free datasets database: Ramen Ratings – Kaggle dataset.

The first section – Basic Data Science with Excel:

Now we are going to explore the read_excel function, first, we import Pandas as pd and call pd.read_excel(“nameOfTheFile.xlsx”) to import an excel file that is inside the same folder of our program as a Pandas dataframe:

import pandas as pd

dataframe = pd.read_excel("The-Ramen-Rater-The-Big-List-1-3400-Current-As-Of-Jan-25-2020.xlsx")

dataframe.head()

A little look at the data:

Review #BrandVarietyStyleCountryStars
03400EDO PackKumamoto Flavour NoodlesCupHong Kong1
13399Pan MeeGoreng Dried Chili Shrimp FlavourPackMalaysia5
23398PaldoKing Lid Ramen Noodle SoupPackSouth Korea5
33397Nissin MiojoCremoso Carno Com ChiliPackBrazil2
43396Samyang FoodsCham Ramen Big BowlBowlSouth Korea2.25

If you want to open different spreadsheets from an excel file, just use the sheet name parameter. In this way, the Dataframe will be loaded from the specific sheet name that you provided. However, if the name of the parameter is not provided the function will return the first spreadsheet. If you want all spreadsheets of the file, specify None in the sheet name parameter. Example:

dataframe = pd.read_excel("The-Ramen-Rater-The-Big-List-1-3400-Current-As-Of-Jan-25-2020.xlsx", 
                          sheet_name="Reviewed")

We successfully loaded the data from the file to the Dataframe, by calling the “head()” method we get to see some of the data. Now let’s try to alter the Dataframe and save it to a new Excel file.

We are going to order the data by the review number, drop the reversed index and add another feature to it by creating a rating from one to ten from the “Stars” feature. It’s important to check the actual type of each column, to see if any casting will be necessary to work with the data. The “Stars” column contains some invalid data entries, data cleaning will be needed.

dataframe.dtypes
Review #     int64
Brand       object
Variety     object
Style       object
Country     object
Stars       object
dtype: object

f you find some difficulty understanding data types, check our tutorial on Understanding Pandas Data Types.

Dataset with the new Rating feature:

Review #BrandVarietyStyleCountryStarsRating
01WestbraeMiso RamenPackUnited States0.51.0
12Wai WaiTom Yum Chili FlavorPackThailand2.04.0
23Wai WaiTom Yum ShrimpPackThailand2.04.0
34Wai WaiOriental Style Instant NoodlesPackThailand1.02.0
45VifonHu Tiu Nam Vang [“Phnom Penh” style] Asian Sty…BowlVietnam3.57.0

Now that the Dataframe has changed, we will try to save it back to an Excel “xlsx” file by using the Pandas Dataframe to_excel() method.

dataframe_sorted.to_excel("ramen-ratings-clean.xlsx")

It is not mandatory, but we can also set the name of the spreadsheet with the parameter “sheet_name”. Another useful parameter is the index, we can use it to drop the index that Pandas has created before we save the data because we already have the “Review Number” and the default Excel number of the row index.

dataframe_sorted.to_excel("ramen-ratings-clean.xlsx", sheet_name="Ramen-ratings", index=False)

The second section – Advanced interactions with Excel data:

Now, we are going to look at the ExcelWriter class from Pandas and its functionalities. If we want to use the Excelwriter class for “.xlsx” files, the Excel engine needs to be “openpyxl”. The engine can be set during the instantiation of the ExcelWriter class by setting the engine parameter to “openpyxl”.

Openpyxl can be installed with the following pip command:
pip install openpyxl

writer = pd.ExcelWriter('ramem-by-country-analytics.xlsx', engine="openpyxl", index=False)

Using the Pandas ExcelWriter class:

Now we will group the data of dataset by the origin country of each ramen. In this way, the new Excel file will have the main spreadsheet with all of the ramen data, and every country will have its own spreadsheet.

Let’s also create some basic analytics of each country and create a Results spreadsheet with those results. Excel gives a superb way of connecting Data Science results to users.

dataframe_sorted.to_excel(writer, "World Data", index=False)

That is how the ExcelWriter class is used to create a new spreadsheet called world data in the excel file ‘ramen-by-country-analytics.xlsx’, the file is defined on the instantiation of the class.

Now we’re going to use the pandas DataFrame method groupby(), it is useful when you want to divide your Dataframe into groups by the value of some column. In this case, the column selected is “Country”:

Separating reviews by country:

Instantiate necessary variables:
# Since the group by returns a tuple with the name of the country and the country dataset
# let's create some useful variables to access it
tuple_name = 0 # Used to access the name of the tuple, eg. Australia
tuple_dataframe = 1 # Used to access the Dataframe that represents some country, eg. Ramen ratings of Australia

analytics = pd.DataFrame(columns=["Country","Number of Reviews","Average Stars","Average Rating", "Rating Standard Deviation"])
Analysis function:
def generate_analytics(analytics_dataframe, country_name, country_dataframe):
    # Take the length of the DataFrame as the number of reviews     
    number_of_reviews = len(country_dataframe)
    # Calculate the average number of stars given to a country ramen
    average_stars = country_dataframe["Stars"].mean()
    # Calculate the average rating given to a country ramen
    average_rating = country_dataframe["Rating"].mean()
    # Calculate the rating standard deviation of the country    
    rating_std = country_dataframe["Rating"].std()


    # Append the analysis of one country to the Analytics Dataframe     
    analytics_dataframe = analytics_dataframe.append({"Country": country_name,"Number of Reviews": number_of_reviews,
                      "Average Stars": average_stars,"Average Rating": average_rating, 
                      "Rating Standard Deviation": rating_std}, ignore_index=True)

    return analytics_dataframe
Generate analytics by iterating through the data:
for country_ramen_data in dataframe_sorted.groupby("Country"):
#     print(country_ramen_data[tuple_data])
    country_name = country_ramen_data[tuple_name]
    country_dataframe = country_ramen_data[tuple_dataframe]

    # Now every country dataframe is being saved on the Excel file     
    country_dataframe.to_excel(writer, country_name, index=False)
    analytics = generate_analytics(analytics, country_name, country_dataframe)

# Let's substitute the nan values from the standard deviation of countries with only one review
# We will change it to zero on the whole dataframe using the fillna() method
analytics = analytics.fillna(0)

# After every country has been saved on the Excel file as a different spreadsheet, we are going to 
# save the results of the analytics on the file as a spreadsheet called Analytics.
analytics.to_excel(writer, "Analytics", index=False)

Top 10 ramen by the Average Rating of each Country:

analytics.sort_values("Average Rating",ascending=False)[:10]
CountryNumber of ReviewsAverage StarsAverage RatingRating Standard Deviation
38Spain24.2500008.5000000.707107
3Cambodia54.2000008.4000001.516575
11France44.1875008.3750001.108678
21Malaysia1894.1661388.3322751.748058
18Indonesia1524.1118428.2236841.485904
36Singapore1364.1084568.2169121.621273
35Sarawak54.0000008.0000001.274755
23Myanmar143.9464297.8928571.619337
20Japan6033.9160457.8320901.978628
9Fiji43.8750007.7500000.866025

Now to apply all of those changes in the file, we have to call the close() method on the ExcelWriter object.

After this function is called and returns successfully, the Excel file is ready to use, it contains all of the spreadsheets created before.

writer.close()

Wait a minute! A ramen analysis from Zimbabwe has just arrived, what do we do? Maybe we could use the ExcelWriter class again, but we have to be careful with the parameter “mode” if we instantiate the ExcelWriter class normally, the write mode will be set to “write” and it will overwrite all of the current data of the file if we write with it.

The ExcelWriter has to be instantiated in the following manner if we want to append some data without overwriting the whole Excel file:

Appending data using ExcelWriter:

writer = pd.ExcelWriter('ramem-by-country-analytics.xlsx', engine="openpyxl", index=False
                       , mode="a") # a stands for append

analytics = pd.DataFrame(columns=["Country","Number of Reviews","Average Stars","Average Rating", "Rating Standard Deviation"])

old_analytics = pd.read_excel("ramem-by-country-analytics.xlsx", "Analytics")
new_analytics = old_analytics.append({"Country": "Zimbabwe","Number of Reviews": 1,
                      "Average Stars": 4,"Average Rating": 4.5, 
                      "Rating Standard Deviation": 0}, ignore_index=True)
new_analytics.to_excel(writer, "Analytics Updated", index=False)
writer.close()

That’s all Folks!

In this tutorial, we learned how to use the Excel file functionalities of Pandas. To sum it up, it’s important to have those tools in your toolbox because Excel makes the data analysis way easier to the final user of your Data Science project. If you found this article interesting, check how to use the “.csv” file functionalities of Pandas in our tutorial Write a Pandas DataFrame to a CSV File.

Enjoy your brand new ramen leaderboard and thanks for reading!

Learn more about the tools used in this Article by also reading their Pandas documentation:

The code utilized in this post is available here on Github.

Gabriel Fuhr

As a Computer Science student, I'm interested in helping other people to achieve their goals by building efficient software. Also studying Python for Machine Learning and Research. Check my Github: https://github.com/gtfuhr

Recent Posts

Adding rows to a Pandas Dataframe

While studying Data Science, we often come across DataFrames ready to be used. Normally, those…

6 days ago

How to Install & Import Pandas in Python

Pandas is one of the most powerful libraries for data analysis and is the most…

2 weeks ago

Decision Trees in Scikit-Learn

Introduction The decision tree is a machine learning algorithm which perform both classification and regression.…

3 weeks ago

A Holistic Guide to Groupby Statements in Pandas

The Importance of Groupby Functions In Data Analysis Whether working in SQL, R, Python, or…

4 weeks ago

Logistic Regression in Sci-Kit Learn

Introduction Logistic regression is an important model used in supervised learning. You can use logistic…

1 month ago

Pandas-Profiling, explore your data faster in Python

All datasets have one obvious thing in common, information, but this information is easy and…

1 month ago