Pandas is a must-have Python library in the repertoire of every data scientist. The package is very crucial for manipulating data in readiness for analysis and machine learning. Pandas can also be used for aggregating data such as performing grouping operations. You can use the tool to perform some quick visualizations. Pandas DataFrames and Series are the building blocks of mastering the Pandas package. In this article, you will learn how to use DataFrames and Series in Pandas.
Ensure that you have Panda installed. You can install it via pip or Conda. Install it and let’s get this plane in the air.
A Panda Series is a one-dimension array that can hold different data types such as Python objects, integers, and floats. If you are coming from the Excel world, you can think of it as one column in an Excel sheet.
Let’s now take a moment and look at you can create a series and perform some operations on it.
Creating a Series
A series is created by providing some data and its index. The data can be a list or even a dictionary. The index is also array-like. You can create the series using `pd.Series` and passing the data and the index.
Here is how the above series looks like.
You can confirm the type of the series using the
type function in Python.
Accessing an Element of Series
Let’s now take a look at how you can access items in a series. For instance, you can use the `index` attribute to access the index of the series.
You can then use these indices to access items in the series. Let’s, for example, access the item with ‘index_one’.
The same can be achieved using square brackets and passing the position of the item you would like to select.
Operations on Pandas Series
You can also perform various operations on a series. For instance, you can obtain the maximum and minimum value in a series. Let’s take a look at how you can do that on the series you created above.
Before you can look at further operations, let’s introduce a second series.
|series2 = pd.Series([1,2,5,4],index =|
[‘index_one’, ‘index_two’,’index_three’, ‘index_four’])
Let’s take a look at the two Pandas series, before you can perform some operations on them. Notice that
series_2 has one extra index compared to the first one.
Some of the operations you can perform on the two series include:
- Adding them
- Subtracting them
- Obtaining the modulus
- Multiplying them
Here’s how the subtraction operation would look like.
|my_series + series2|
You can see that a null value is obtained because the first series didn’t have `index_four`.
The other operations can be done in a similar manner.
|my_series + series2|
my_series / series2
my_series % series2
my_series * series2
With that Series information out of the way, let’s now take a look at Pandas DataFrames. A Pandas DataFrame is actually several series that have been brought together. A Pandas DataFrame, is, therefore, a two-dimensional representation of data. It will usually contain rows and columns, just like an Excel spreadsheet.
Creating a DataFrame
A Panda DataFrame can be created by passing the following information to the Panda DataFrame function:
- the data. This can be a dictionary or a series
- the index to use for the DataFrame
- the column labels for the DataFrame
Let’s take a look at this in action.
Start by creating the data in a dictionary. The data can also contain null values.
|import numpy as np|
The next step is to use the `DataFrame` function to create the Pandas DataFrame.
|df = pd.DataFrame(names_dict)|
Next, you can use the `head` function to view the first five records in the DataFrame.
The following step is usually to look at the summary of the DataFrame by using the `info` function.
This function will show you the number of entities in the DataFrame and inform you if there are any null values.
You can also use the `describe` function to show some descriptive statistics about the dataset.
Indexing and Selecting Data
There are a couple of ways of select data in Pandas. Let’s start by looking at integer-location-based indexing.
If you check the type of the result, you will notice that it’s a Pandas series.
You can also select the data by using the column names.
Passing the columns as a list returns the result as a Pandas DataFrame and not a Series.
Working with missing data
Let’s now take a look at how you can identify and deal with null values. You can use Pandas `isnull` function on the DataFrame to identity columns that have null values.
In this case, you can see that the Age column contains null values. Let’s fill the null values with the mean age. Specifying `inplace=True` changes this in the original DataFrame. If you don’t want to change the original DataFrame, you can leave this option out and create a new copy of the DataFrame instead.
In some cases, you might want to remove some columns that you don’t intend to use. This can be achieved using the `drop` function and specifying the name of the column and its axis. If you intend to make this change on the original dataset, you will have to pass the `inplace=True` argument.
|df.drop(‘Age’, axis=1) # Axis 1 is columns, Axis 0 is rows|
Creating new columns
When doing your analysis, you may need to perform some computation and create a new column with that result. In Pandas a new column can be created by calling it as if it already exists. If it doesn’t exist, Pandas will go ahead and create a new column.
Let’s take a look at how this can be done by multiplying two columns in the DataFrame.
|df[‘times’] = df[‘Age’] * df[‘Phone’]|
Filtering a DataFrame
In the process of your analysis, you may want to filter the DataFrame based on a certain condition. For instance, let’s filter this DataFrame to give us the results where the age is greater than 20.
|df[ df[‘Age’] > 20 ]|
You can also instruct Pandas to return the data where the name matches a given name of a person.
You also have the ability to filter on more than one condition. For instance, you can filter on the Age and Phone columns.
|df[ (df[‘Age’] > 20) & (df[‘Phone’] > 40) ]|
Grouping is a very common operation in data analysis. The `groupby` function in Panda provides that functionality. For instance, you can group by the `Uni` column and return the result after computing the mean age. While at it, you can also sort the values in descending order. Using the `reset_index` function resets the index and returns the result as a Pandas DataFrame.
|group = df.groupby([‘Uni’])[‘Age’].mean().sort_values(ascending=False).reset_index()|
Exporting your analysis
Once you are done with your analysis, you can export the DataFrame as a JSON, Excel, or CSV file. Let’s take a look at how you can export the DataFrame obtained above.
|group.to_csv(‘names_with_index.csv’) # index=True by default|
Reading files with Pandas
Pandas provide the functionality to read all the above file types. For example, let’s read in the Excel file exported above.
|data = pd.read_excel(“names.xlsx”)|
The other file formats can be loaded in a similar manner.
|names = pd.read_csv(“names.csv”)|
json = pd.read_json(“data.json”)
Create a Spreadsheet-like pivot table as a DataFrame
Pandas also supports the creating of Excel-like pivot tables. This is done using the `pivot_table` function. This function also allows one to apply an aggregation function such as the mean on the data. For example, the pivot table below returns the mean age for the `Uni` column.
In this article, you have seen how to create and manipulate Series and Pandas DataFrame. More specifically you have covered:
- How to create Panda Series
- How to create Panda DataFrame
- Selecting data in a DataFrame
- Grouping data in a DataFrame
- Operations on DataFrames
Just to mention a few.
The code used in this article can be found here.