Pandas is one of the most popular libraries for data analysis in the world and is growing rapidly. But, what exactly is it and why is it so important to the data science and analytics community? We’ll give you an in-depth history and explanation of the library’s importance to data analysis in Python and how it has taken over the industry since it’s inception in 2008.
In 2008, Wes McKinney started the Pandas library, as an open-source project. The project is currently a BSD-licensed library. Pandas got its name from the term “Panel Data” which is used in economics for the analysis of data sets. The primary goal of the library was to create a more powerful and less programmatically cumbersome approach to analyzing data at scale in Python than the existing solutions in tools like R and SAS. Wes’ vision for Python’s growth as a language for data analysis was so strong that he dropped out of a PhD program to pursue growing the library.
When Pandas was first introduced, its benefits to the data analysis community already using Python for scientific and analysis work was obvious. Many of its initial features replicated those of the R programming language pertaining to DataFrame and Series objects, however, given the backend of Python is the C programming language (the library is ultimately a higher-level abstraction of the library NumPy), the performance within Python for managing these large DataFrames was much higher than R could have offered.
The core feature of Pandas are the following:
While there are many other deeper concepts to understand (see below) the main functionality is designed for reading, manipulating and transforming, and analyzing data. Many of its features replicate those available through the R programming language basics as well as operations commonly used in SQL databases.
These features were a step up from the low-level approach to data manipulation that existed in 2008 which included significant programming to manage data in NumPy, list, and dictionary objects using standard Python and a litany of additional libraries.
For those starting out with Pandas, they should get familiar with the following practical applications of the library:
These are all features of the library that can be used on large and small datasets for analysis. For those familiar with spreadsheets such as Google Sheets or Excel, many of the funcitons offered by that software are available, as seen from the above, in Pandas core concepts. Additional comparisons to SQL also exist, primarily in Pandas ability to manipulate columnar based data (data in DataFrames is arranged in columns and rows similar to a columnar table in an SQL-database).
Due to the substantial adoption of the Pandas library, there has been a large increase in official supporting libraries that are compatible with and enhance Pandas performance as well as open-sourced projects supporting enhancement of the library. While we don’t cover them all in an exhaustive list here, some of the more popular libraries are listed below.
Pandas is supported by the Chan-Zuckerberg initiative as well as having many institutional partners who contribute to the maintenance and enhancement of the library. These include Anaconda, Two Sigma, RStudio, and Ursa Labs.
The number of individual contributors to the Pandas library is large and includes several consistent contributors such as Andy Hayden and Wes McKinney who have both made additions to the code base as well as supporting end-users through StackOverflow.
The main sponsor of Pandas is the organization NumFocus, a non-profit with a mission to “promote open practices in research, data, and scientific computing by serving as a discal sponsor of open source projects and organizing community driven educational programs.”
All datasets have one obvious thing in common, information, but this information is easy and…