Categories: Pandas

Pandas-Log and Its Debugging Capabilities

Introduction:

Working with data has a strong connection with programming. Therefore, a Data Scientist who knows the best practices of software development will have advantages while solving Data Science problems. To achieve this, one needs to go beyond the basics while studying Pandas. In this tutorial, we will debug some Pandas code with the help of Pandas-Log.

Debugging errors in the source code and preparing data is a great part of a data scientist’s job. But what if you could solve errors faster? Perhaps, looking “under the hood” of Pandas will make us find the solution faster. And improve the explainability of our operations in the data for free!

For this objective, we will use Pandas-Log, an open-source Pandas logging tool. This tool will help us to see what is happening while we are applying processing steps to data. A Pandas operation can affect rows and columns, Pandas-Log shows us how many columns and rows changed. In the case of any alteration of the Data Frame, it shows what are the remaining rows and columns.

Let’s get a dataset to work with and show a few logging examples. We will use game sales data from Kaggle, Google’s Data Science community, it is available on the following link: Video Game Sales – Kaggle. If the data is no longer available on Kaggle, you can download it from here on my Github.

The data contains features such as name, platform, year, genre, publisher, and the number of sales by global region. Let’s explore the data and do an analysis of it? If we find some difficulty while doing it, we will use the help of Pandas-Log.

Seeing “under the hood” of Queries with Pandas-Log:

In this step, we will see the structure of the dataset:

import pandas as pd
import numpy as np

df = pd.read_csv("vgsales.csv")

df.head()
Games sales data:
RankNamePlatformYearGenrePublisherNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Sales
01Wii SportsWii2006.0SportsNintendo41.4929.023.778.4682.74
12Super Mario Bros.NES1985.0PlatformNintendo29.083.586.810.7740.24
23Mario Kart WiiWii2008.0RacingNintendo15.8512.883.793.3135.82
34Wii Sports ResortWii2009.0SportsNintendo15.7511.013.282.9633.00
45Pokemon Red/Pokemon BlueGB1996.0Role-PlayingNintendo11.278.8910.221.0031.37

How this simple call of the Pandas head() method would look like if we were using Pandas-Log? To use it, just install it via PIP along with its required packages.

pip install --user pandas-log
pip install --user pandas-flavor
pip install --user humanize

After the installation of the pandas-log package and its requirements, we can use it by importing pandas_log:

import pandas_log
with pandas_log.enable():
    df.head()

Pandas-Log Output:

1) head(n: int = 5):
    Metadata:
    * Picked the first 5 rows.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 1.7 kB.

With Pandas-log enabled, the description of the function “head()” is displayed, such as the size of the input Dataframe and the size of the output Dataframe.

After looking at the data, everyone can agree that Nintendo is a strong video game publisher. To see clearer the Nintendo influence in the video game market, we will filter the data to Nintendo games developed before the 2000s. Only Nintendo games in the top 50 will be accepted.

old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nlntendo'")
old_nintendo

Best-selling Nintendo old Games… Wait! Zero games?!
RankNamePlatformYearGenrePublisherNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Sales

The output DataFrame is empty! Nintendo didn’t make any successful videogame before the 2000s? What about Mario and Donkey Kong?

Let’s bring Pandas-Log to the rescue! It will help us find any mistake made in the query.

with pandas_log.enable():
    old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nlntendo'")

Pandas-Log output:

1) query(expr="Year<2000", inplace=False):
    Metadata:
    * Removed 14624 rows (88.10700084347512%), 1974 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 661.8 kB.

2) query(expr="Rank<50", inplace=False):
    Metadata:
    * Removed 1964 rows (99.49341438703141%), 10 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 661.8 kB.
    * Output Dataframe size is 3.3 kB.

3) query(expr="Publisher=='Nlntendo'", inplace=False):
    Metadata:
    * Removed 10 rows (100.0%), 0 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 3.3 kB.
    * Output Dataframe size is 0 Bytes.

Looks like the incorrect step was in the last query, the query that checks if the Publisher of the remaining 10 rows is Nintendo. Wait! Looks like “Nlntendo” contains a typo, it should be “Nintendo”. See how the output Dataframe looks like after the correction:

old_nintendo = df.copy().query("Year<2000").query("Rank<50").query("Publisher=='Nintendo'")
old_nintendo
Best-selling old Nintendo games under the Top 50:
RankNamePlatformYearGenrePublisherNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Sales
12Super Mario Bros.NES1985.0PlatformNintendo29.083.586.810.7740.24
45Pokemon Red/Pokemon BlueGB1996.0Role-PlayingNintendo11.278.8910.221.0031.37
56TetrisGB1989.0PuzzleNintendo23.202.264.220.5830.26
910Duck HuntNES1984.0ShooterNintendo26.930.630.280.4728.31
1213Pokemon Gold/Pokemon SilverGB1999.0Role-PlayingNintendo9.006.187.200.7123.10
1819Super Mario WorldSNES1990.0PlatformNintendo12.783.753.540.5520.61
2122Super Mario LandGB1989.0PlatformNintendo10.832.714.180.4218.14
2223Super Mario Bros. 3NES1988.0PlatformNintendo9.543.443.840.4617.28
3031Pokémon Yellow: Special Pikachu EditionGB1998.0Role-PlayingNintendo5.895.043.120.5914.64
4647Super Mario 64N641996.0PlatformNintendo6.912.851.910.2311.89

Let’s see how another query looks under the hood with Pandas-Log. A query will return a DataFrame containing the top 10 sports games of the Playstation 2 platform. The last operation will sort the games by the number of their US sales:

df.query("Platform=='PS2'").query("Genre=='Sports'").nlargest(10,"NA_Sales")
Best-Selling PS2 sports games in the US:
RankNamePlatformYearGenrePublisherNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Sales
179180Madden NFL 2004PS2NaNSportsElectronic Arts4.260.260.010.715.23
238239Madden NFL 2005PS22004.0SportsElectronic Arts4.180.260.010.084.53
211212Madden NFL 06PS22005.0SportsElectronic Arts3.980.260.010.664.91
240241Madden NFL 07PS22006.0SportsElectronic Arts3.630.240.010.614.49
279280Madden NFL 2003PS22002.0SportsElectronic Arts3.360.210.010.564.14
248249Tony Hawk’s Pro Skater 3PS22001.0SportsActivision2.661.290.010.464.41
451452Madden NFL 2002PS22001.0SportsElectronic Arts2.500.160.010.423.08
306307Tony Hawk’s UndergroundPS22003.0SportsActivision2.291.170.010.423.90
534535NBA StreetPS22001.0SportsElectronic Arts2.190.220.000.382.79
580581ESPN NFL 2K5PS22004.0SportsSega2.150.120.000.362.62
with pandas_log.enable():
    df.query("Platform=='PS2'").query("Genre=='Sports'").nlargest(10,"NA_Sales")

Pandas-Log Output:

1) query(expr="Platform=='PS2'", inplace=False):
    Metadata:
    * Removed 14437 rows (86.98035907940715%), 2161 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 737.8 kB.

2) query(expr="Genre=='Sports'", inplace=False):
    Metadata:
    * Removed 1761 rows (81.49005090236003%), 400 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 737.8 kB.
    * Output Dataframe size is 135.9 kB.

3) nlargest(n=10, columns="NA_Sales", keep='first'):
    Metadata:
    * Picked 10 largest rows by columns (NA_Sales).
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 135.9 kB.
    * Output Dataframe size is 3.3 kB

We can see six Madden NFL games taking place in the US top 10 for PS2 sports games and Pandas-Log describing how the operations took place. With those examples, we can see the expressivity of the Pandas-Log tool and its usefulness while debugging and documenting Data Science Projects.

Utilization example of Pandas-Log in a Data Science Project:

Suppose your client wants his Data Science project error-free and wants to check its inner workings once in a while. In this case, show them how you perform the data queries, by giving your client access to a Verbosity option. A Verbosity variable will determine if the program will show extra output while doing queries:

Verbosity variable use example with Pandas-Log:
import pandas as pd
import numpy as np
import pandas_log

# Set if the program will show the Pandas-Log output while doing a query
# verbosity = False
verbosity = True

dataframe = pd.read_csv("vgsales.csv")

# Execute two important queries asked by the client
# When the client wants to be assured of the results, he would set the verbosity to 1
if(verbosity):
    with pandas_log.enable():
        top_five_take_two_games = dataframe.query("Publisher=='Take-Two Interactive'").nsmallest(5,"Rank")
else:
    top_five_take_two_games = dataframe.query("Publisher=='Take-Two Interactive'").nsmallest(5,"Rank")

top_five_take_two_games

Pandas-Log output:

1) query(expr="Publisher=='Take-Two Interactive'", inplace=False):
    Metadata:
    * Removed 16185 rows (97.5117484034221%), 413 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 140.6 kB.

2) nsmallest(n=5, columns="Rank", keep='first'):
    Metadata:
    * Picked 5 smallest rows by columns (Rank).
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 140.6 kB.
    * Output Dataframe size is 1.7 kB.
Top 5 Best-selling games from Take-Two Interactive:
RankNamePlatformYearGenrePublisherNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Sales
1617Grand Theft Auto VPS32013.0ActionTake-Two Interactive7.019.270.974.1421.40
1718Grand Theft Auto: San AndreasPS22004.0ActionTake-Two Interactive9.430.400.4110.5720.81
2324Grand Theft Auto VX3602013.0ActionTake-Two Interactive9.635.310.061.3816.38
2425Grand Theft Auto: Vice CityPS22002.0ActionTake-Two Interactive8.415.490.471.7816.15
3839Grand Theft Auto IIIPS22001.0ActionTake-Two Interactive6.994.510.301.3013.10
if(verbosity):
    with pandas_log.enable():
        top_five_sony_games = dataframe.query("Publisher=='Sony Computer Entertainment'").nsmallest(5,"Rank")
else:
    top_five_sony_games = dataframe.query("Publisher=='Sony Computer Entertainment'").nsmallest(5,"Rank")

top_five_sony_games

Pandas-Log output:

1) query(expr="Publisher=='Sony Computer Entertainment'", inplace=False):
    Metadata:
    * Removed 15915 rows (95.88504639113147%), 683 rows remaining.
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 5.5 MB.
    * Output Dataframe size is 237.8 kB.

2) nsmallest(n=5, columns="Rank", keep='first'):
    Metadata:
    * Picked 5 smallest rows by columns (Rank).
    Execution Stats:
    * Execution time: Step Took a moment seconds..
    * Input Dataframe size is 237.8 kB.
    * Output Dataframe size is 1.7 kB.
Top 5 Best-selling games from Sony Computer Entertainment:
RankNamePlatformYearGenrePublisherNA_SalesEU_SalesJP_SalesOther_SalesGlobal_Sales
2829Gran Turismo 3: A-SpecPS22001.0RacingSony Computer Entertainment6.855.091.871.1614.98
4748Gran Turismo 4PS22004.0RacingSony Computer Entertainment3.010.011.107.5311.66
5253Gran TurismoPS1997.0RacingSony Computer Entertainment4.023.872.540.5210.95
5455Gran Turismo 5PS32010.0RacingSony Computer Entertainment2.964.880.812.1210.77
6667Final Fantasy VIIPS1997.0Role-PlayingSony Computer Entertainment3.012.473.280.969.72

Pandas-Log, make your Pandas debugging easier:

Those are the use examples of Pandas-Log, with it, you can add those Pandas-Logs to the documentation of your project, so your clients can be sure that the queries that you make are working. Thanks for reading!

Are you still curious about Pandas-Log and the functions used in this tutorial? If yes, check their documentation by accessing their links below:

The code utilized in this post is available here on Github.

Gabriel Fuhr

As a Computer Science student, I'm interested in helping other people to achieve their goals by building efficient software. Also studying Python for Machine Learning and Research. Check my Github: https://github.com/gtfuhr

Recent Posts

Adding rows to a Pandas Dataframe

While studying Data Science, we often come across DataFrames ready to be used. Normally, those…

6 days ago

How to Install & Import Pandas in Python

Pandas is one of the most powerful libraries for data analysis and is the most…

2 weeks ago

Decision Trees in Scikit-Learn

Introduction The decision tree is a machine learning algorithm which perform both classification and regression.…

3 weeks ago

A Holistic Guide to Groupby Statements in Pandas

The Importance of Groupby Functions In Data Analysis Whether working in SQL, R, Python, or…

4 weeks ago

Logistic Regression in Sci-Kit Learn

Introduction Logistic regression is an important model used in supervised learning. You can use logistic…

1 month ago

Pandas-Profiling, explore your data faster in Python

All datasets have one obvious thing in common, information, but this information is easy and…

1 month ago