Categories: Pandas

Pandas: An Open Source Library for Python

A Brief Introduction

Pandas is an Open Source library built on top of NumPy. It allows for fast analysis and Data Preparation and cleaning. It excels in performance and productivity. It also has built-in visualization features. It can work with data from a wide variety of sources.

Installation

You’ll need to install pandas by going to your command line or terminal and using :

pip install pandas

Or alternatively, You may install Anaconda in your PC where Pandas is pre-installed.

Getting Started

First, you need to import the numpy and pandas library:

import numpy as np
import pandas as pd

np is used to represent NumPy and pd is used to represent Pandas.

Using Pandas

There are different ways to use pandas:

  1. By converting a Python’s list, dictionary or Numpy array to a Pandas series or a data frame.
  2. By opening a local file using Pandas, usually a CSV, TSV or excel file.

We can open a pandas file using the following command:

pd.read_filetype()

The file type can be replaced by pressing tab to autocomplete and choosing the appropriate file type:

Data Structures in Pandas

Pandas deal with 2 different data structures.

  1. Series: One Dimensional array with homogeneous data. The values of the series data structure can be altered but not the size.
  2. DataFrame: This is a 2-D array with Heterogeneous data. For example, It can hold and handle SQL table data.

Series

Pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called index. Pandas Series is nothing but a column in an excel sheet. Let’s see how it is declared:

labels=['a','b','c']
my_data=[10,20,30]
pd.Series(data=my_data,index=labels)

The output that we receive is:

The above is an example of a labeled index Series.

Let’s take another example, say:

ser1=pd.Series([1,2,3,4],['USA','India','China','Japan'])
ser2=pd.Series([1,2,5,4],['USA','India','Italy','Japan'])

Examples of basic operation on series include:

ser1+ser2

Here, we see that the values of the corresponding labels which are equal are added and the rest is considered null.

Dataframes

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

For example, let’s create a DataFrame

data = {'Country':['USA', 'India', 'China', 'Japan'],
        'Value1':[10, 20, 30, 40],
        'Value2':[1, 2, 3, 4],
        'Capital':['Washington DC','New Delhi','Beijing','Tokyo']}
df = pd.DataFrame(data)
df

So, basically what we have here, is a list of columns and series of rows that have the same index values. Each of the columns is a pandas Series.

Indexing and Selection in DataFrames:

Indexing of a column in a DataFrame can be done by following the below example:

df['Country']

The above output is a series that is a part of the DataFrame ‘df’.

Pandas support the addition of new columns by just specifying the column as it already exists in the DataFrame.

df['new']=df['Value1']+df['Value2']

To drop the new column permanently, we use the following command

df.drop(['new'],axis=1,inplace=True)

If you want to delete a row, say ‘1’ we use the following command:

df.drop([1],axis=0,inplace=True)

This and other similar operations to delete rows and columns can be seen here.

Note that, by default, axis=0 and inplace=False.

To select a row we can use:

df.loc[0]

It is basically a list of the row specified.

Consider the following code:

newind='red green yellow blue'.split()
df['color']=newind
df.set_index('color')

This is how we set a new index to a DataFrame.

EDA Operations

df.head() displays the first 5 records of the dataframe. Similarly, df.tail displays the last 5 records of the dataframe.

To check the data type of the columns we can use the following command:

df.dtypes

Conclusion

There are many more operations which we can perform using pandas. Please refer to the Python Pandas Documentation. I hope you enjoyed reading this article and are clear with the basic concepts of Pandas. Thank You and Have a Good Day!

For more on how the Pandas library is growing over time and why, see our post here.

Surya Remanan

Budding Data Scientist and a Student. Loves to blog about Data Science.

Recent Posts

Adding rows to a Pandas Dataframe

While studying Data Science, we often come across DataFrames ready to be used. Normally, those…

6 days ago

How to Install & Import Pandas in Python

Pandas is one of the most powerful libraries for data analysis and is the most…

2 weeks ago

Decision Trees in Scikit-Learn

Introduction The decision tree is a machine learning algorithm which perform both classification and regression.…

3 weeks ago

A Holistic Guide to Groupby Statements in Pandas

The Importance of Groupby Functions In Data Analysis Whether working in SQL, R, Python, or…

4 weeks ago

Logistic Regression in Sci-Kit Learn

Introduction Logistic regression is an important model used in supervised learning. You can use logistic…

1 month ago

Pandas-Profiling, explore your data faster in Python

All datasets have one obvious thing in common, information, but this information is easy and…

1 month ago