Pandas: An Open Source Library for Python

pandas library

A Brief Introduction

Pandas is an Open Source library built on top of NumPy. It allows for fast analysis and Data Preparation and cleaning. It excels in performance and productivity. It also has built-in visualization features. It can work with data from a wide variety of sources.

Installation

You’ll need to install pandas by going to your command line or terminal and using :

pip install pandas

Or alternatively, You may install Anaconda in your PC where Pandas is pre-installed.

Getting Started

First, you need to import the numpy and pandas library:

import numpy as np
import pandas as pd

np is used to represent NumPy and pd is used to represent Pandas.

Using Pandas

There are different ways to use pandas:

  1. By converting a Python’s list, dictionary or Numpy array to a Pandas series or a data frame.
  2. By opening a local file using Pandas, usually a CSV, TSV or excel file.

We can open a pandas file using the following command:

pd.read_filetype()

The file type can be replaced by pressing tab to autocomplete and choosing the appropriate file type:

File read types in Pandas

Data Structures in Pandas

Pandas deal with 2 different data structures.

  1. Series: One Dimensional array with homogeneous data. The values of the series data structure can be altered but not the size.
  2. DataFrame: This is a 2-D array with Heterogeneous data. For example, It can hold and handle SQL table data.

Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Let’s see how it is declared:

labels=['a','b','c']
my_data=[10,20,30]
pd.Series(data=my_data,index=labels)

The output that we receive is:

Screen-Shot-2019-12-14-at-11-48-03-AM

The above is an example of a labeled index Series.

Let’s take another example, say:

ser1=pd.Series([1,2,3,4],['USA','India','China','Japan'])
Screen-Shot-2019-12-14-at-11-45-20-AM
ser2=pd.Series([1,2,5,4],['USA','India','Italy','Japan'])
Screen-Shot-2019-12-14-at-11-52-18-AM

Examples of basic operation on series include:

ser1+ser2
Screen-Shot-2019-12-18-at-11-20-36-PM

Here, we see that the values of the corresponding labels which are equal are added and the rest is considered null.

Dataframes

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the datarows, and columns.

For example, let’s create a DataFrame

data = {'Country':['USA', 'India', 'China', 'Japan'],
        'Value1':[10, 20, 30, 40],
        'Value2':[1, 2, 3, 4],
        'Capital':['Washington DC','New Delhi','Beijing','Tokyo']}
df = pd.DataFrame(data)
df
Screen-Shot-2019-12-19-at-12-34-08-AM

So, basically what we have here, is a list of columns and series of rows that have the same index values. Each of the columns is a pandas Series.

Indexing and Selection in DataFrames:

Indexing of a column in a DataFrame can be done by following the below example:

df['Country']
Screen-Shot-2019-12-19-at-12-23-58-AM

The above output is a series that is a part of the DataFrame ‘df’.

Pandas support the addition of new columns by just specifying the column as it already exists in the DataFrame.

df['new']=df['Value1']+df['Value2']
Screen-Shot-2019-12-19-at-12-37-53-AM

To drop the new column permanently, we use the following command

df.drop(['new'],axis=1,inplace=True)

If you want to delete a row, say ‘1’ we use the following command:

df.drop([1],axis=0,inplace=True)

This and other similar operations to delete rows and columns can be seen here.

Note that, by default, axis=0 and inplace=False.

To select a row we can use:

df.loc[0]
Screen-Shot-2019-12-19-at-12-55-30-AM

It is basically a list of the row specified.

Consider the following code:

newind='red green yellow blue'.split()
df['color']=newind
Screen-Shot-2019-12-19-at-1-07-38-AM
df.set_index('color')
Screen-Shot-2019-12-19-at-1-17-57-AM

This is how we set a new index to a DataFrame.

EDA Operations

Screen-Shot-2019-12-19-at-1-25-09-AM

df.head() displays the first 5 records of the dataframe. Similarly, df.tail displays the last 5 records of the dataframe.

To check the data type of the columns we can use the following command:

df.dtypes
Screen-Shot-2019-12-19-at-1-31-33-AM

Conclusion

There are many more operations which we can perform using pandas. Please refer to the Python Pandas Documentation. I hope you enjoyed reading this article and are clear with the basic concepts of Pandas. Thank You and Have a Good Day!

For more on how the Pandas library is growing over time and why, see our post here.