What is Pandas for Data Analysis?

Source: https://pandas.pydata.org/

Pandas is one of the most popular libraries for data analysis in the world and is growing rapidly. But, what exactly is it and why is it so important to the data science and analytics community? We’ll give you an in-depth history and explanation of the library’s importance to data analysis in Python and how it has taken over the industry since it’s inception in 2008.

How Pandas Got Started

Wes McKinney, credit: WesMcKinney.com

In 2008, Wes McKinney started the Pandas library, as an open-source project. The project is currently a BSD-licensed library. Pandas got its name from the term “Panel Data” which is used in economics for the analysis of data sets. The primary goal of the library was to create a more powerful and less programmatically cumbersome approach to analyzing data at scale in Python than the existing solutions in tools like R and SAS. Wes’ vision for Python’s growth as a language for data analysis was so strong that he dropped out of a PhD program to pursue growing the library.

Key Features of Pandas

When Pandas was first introduced, its benefits to the data analysis community already using Python for scientific and analysis work was obvious. Many of its initial features replicated those of the R programming language pertaining to DataFrame and Series objects, however, given the backend of Python is the C programming language (the library is ultimately a higher-level abstraction of the library NumPy), the performance within Python for managing these large DataFrames was much higher than R could have offered.

The core feature of Pandas are the following:

  • Reshaping and pivoting data
  • Reading and Writing data from common file and database formats
  • Merging and joining data sets
  • Data filtration features and integrated handling of missing data

While there are many other deeper concepts to understand (see below) the main functionality is designed for reading, manipulating and transforming, and analyzing data. Many of its features replicate those available through the R programming language basics as well as operations commonly used in SQL databases.

These features were a step up from the low-level approach to data manipulation that existed in 2008 which included significant programming to manage data in NumPy, list, and dictionary objects using standard Python and a litany of additional libraries.

Core Concepts to Understand

For those starting out with Pandas, they should get familiar with the following practical applications of the library:

  • DataFrames & Series objects
  • Reading & Writing Data
  • Aggregating & Grouping Data
  • Pivoting Tables
  • Time Series Analysis
  • Visualizations in Pandas
  • Merging & Joining data

These are all features of the library that can be used on large and small datasets for analysis. For those familiar with spreadsheets such as Google Sheets or Excel, many of the funcitons offered by that software are available, as seen from the above, in Pandas core concepts. Additional comparisons to SQL also exist, primarily in Pandas ability to manipulate columnar based data (data in DataFrames is arranged in columns and rows similar to a columnar table in an SQL-database).

Supporting Libraries

Due to the substantial adoption of the Pandas library, there has been a large increase in official supporting libraries that are compatible with and enhance Pandas performance as well as open-sourced projects supporting enhancement of the library. While we don’t cover them all in an exhaustive list here, some of the more popular libraries are listed below.

Pandas-Profiling

Modin

pandas-gbq

Statsmodels

sklearn-pandas

seaborn

Plotly

Jupyter Notebook

Geopandas

Pandas-Log

Community Support for Pandas

Pandas is supported by the Chan-Zuckerberg initiative as well as having many institutional partners who contribute to the maintenance and enhancement of the library. These include Anaconda, Two Sigma, RStudio, and Ursa Labs.

The number of individual contributors to the Pandas library is large and includes several consistent contributors such as Andy Hayden and Wes McKinney who have both made additions to the code base as well as supporting end-users through StackOverflow.

The main sponsor of Pandas is the organization NumFocus, a non-profit with a mission to “promote open practices in research, data, and scientific computing by serving as a discal sponsor of open source projects and organizing community driven educational programs.”