Categories: PandasPython

Tips for Performing EDA With Python

What is Exploratory Data Analysis (EDA)?

EDA with Python is a critical skill for all data analysts, scientists, and even data engineers. EDA, or Exploratory Data Analysis, is the act of analyzing a dataset to understand the main statistical characteristics with visual and statistical methods.

“The greatest value of a picture is when it forces us to notice what we never expected to see.

John W. Tukey

John Tukey defined the main process for statisticians, and now data analysts, to explore data to enable the creation of hypotheses about a given dataset. The steps that should be taken have varied since Tukey came up with this process in 1961. However, many of the basics have not changed including the following:

  • Loading and Understanding Data Definitions in Your Data
  • Checking the Contents of the Data For Issues
  • Assessing the Data Types
  • Extracting Summary Statistics
  • Generation of Data Visualizations:
    • Boxplots
    • Scatter Plots
  • Creating Advanced Statistics
    • Correlation Matrices

Where we stop currently in this post is jumping into advanced diagnostics for EDA for the purposes of checking regression analysis, classifier analysis, and clustering analysis – operations normally reserved for specific business problem solutions. Those topics we’ll cover in more detailed posts about those specific areas of analysis.

As for when and where EDA occurs in the traditional analytics and data science life-cycle, we can see in the below that EDA is one of the first and most critical steps before we proceed to any type of productionization of an algorithm. Within the CRISP-DM (Cross Industry Standard Process for Data Mining) we perform EDA in the Data Understanding phase of our initial analysis review.

CRISP-DM diagram

In this post, we’ll go over at a very high level some of the Business Understanding steps, but mainly we will focus on the second step of the CRISP-DM framework.

Python Libraries For EDA

Explain each library

Import code for each library

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline 
sns.set(color_codes=True)

Provide a reference to Pandas Ecosystem in What is Pandas

Load the Sample Data

We’ll be using an open-source dataset from FSU on Home sale statistics. It contains data on fifty home sales, with selling price, asking price, living space, rooms, bedrooms, bathrooms, age, acreage, taxes. There is also an initial header line, which we will modify in our data loading steps below:

file_name = "https://people.sc.fsu.edu/~jburkardt/data/csv/homes.csv"
df = pd.read_csv(file_name)
df.columns = ['Sell', 'List', 'Living', 'Rooms', 'Beds', 'Baths',
       'Age', 'Acres', 'Taxes']
df.head()
  • Data column explanations/definitions

Check Column & Row Contents

df.dtypes
df.count()
df = df.drop_duplicates()
df.count()
df.isnull().sum()
df.isnull().values.any()

Column Data Type Assessment

One additional step we should be taking as a part of our evaluation of the data is to see whether the datatypes loaded in the original dataset match our descriptive understanding of the underlying data.

One example of this is that all data in our dataset is being read in as integer and float values. We should change at least two of the int64 variables into categorical datatypes – Beds & Rooms. This is done as those data points are of limited and fixed number of possible values.

DataFrame.dtypes
Series.astype(‘category’)
df['Beds'] = df['Beds'].astype('category')
df['Rooms'] = df['Rooms'].astype('category')
df.dtypes

Summary Statistics

df.describe()

Boxplots

  • Pandas
  • Identify outliers:
    • Explain the process and meaning
  • —  The range of the data provides us with a measure of spread and is equal to a value between the smallest data point (min) and the largest one (Max)
  • —  The interquartile range (IQR), which is the range covered by the middle 50% of the data.
  • —  IQR = Q3 – Q1, the difference between the third and first quartiles. The first quartile (Q1) is the value such that one quarter (25%) of the data points fall below it, or the median of the bottom half of the data. The third quartile is the value such that three quarters (75%) of the data points fall below it, or the median of the top half of the data.
  • —  The IQR can be used to detect outliers using the 1.5(IQR) criteria. Outliers are observations that fall below Q1 – 1.5(IQR) or above Q3 + 1.5(IQR).

    • Show outliers
  • BoxPlot by Categorical variables

Histogram Generation

Scatter Plots

  • Matrix Plot with Pandas/SNS
  • Individual plots with

Pair Plots

Correlation Matrix

A correlation matrix produced by DataFrame.corr() and styled by Seaborn.

https://kite.com/blog/python/data-analysis-visualization-python/

https://towardsdatascience.com/hitchhikers-guide-to-exploratory-data-analysis-6e8d896d3f7e

https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce

https://infovis-wiki.net/wiki/Exploratory_Data_Analysis_(EDA)

Data Analysis Summary

  • Taxes and the Sell price appear highly, positively, correlated
  • Acres compared with Taxes and Sell price do not have strong correlations
  • Age and Taxes, Sell, and List prices all appear to be negatively correlated
  • List and Sell prices are positively correlated in the positive direction

Summary

Andrew W. Owens

Analytics and sciences contributor and professional. Specializing in Python and GCP.

Recent Posts

Matplotlib Visualizations 101

Introduction In this article, we are going to get a detailed explanation of Matplotlib Visualizations in Python. Matplotlib is the…

4 days ago

Pandas: An Open Source Library for Python

A Brief Introduction Pandas is an Open Source library built on top of NumPy. It allows for fast analysis and…

1 month ago

Concatenate, Merge, And Join Data with Pandas

Importance of Merging & Joining Data Many need to join data with Pandas, however there are several operations that are…

2 months ago

Extracting Data From Gmail Emails With Python

Despite the mass investment by third parties to provide API access to reports and data that their customers want, email…

2 months ago

What is Pandas for Data Analysis?

Pandas is one of the most popular libraries for data analysis in the world and is growing rapidly. But, what…

2 months ago

Transform JSON Into a DataFrame

JSON is one of the most common data formats available in digital and non-digital applications. As a result, there it…

3 months ago