EDA with Python is a critical skill for all data analysts, scientists, and even data engineers. EDA, or Exploratory Data Analysis, is the act of analyzing a dataset to understand the main statistical characteristics with visual and statistical methods.
“The greatest value of a picture is when it forces us to notice what we never expected to see.”John W. Tukey
John Tukey defined the main process for statisticians, and now data analysts, to explore data to enable the creation of hypotheses about a given dataset. The steps that should be taken have varied since Tukey came up with this process in 1961. However, many of the basics have not changed including the following:
Where we stop currently in this post is jumping into advanced diagnostics for EDA for the purposes of checking regression analysis, classifier analysis, and clustering analysis – operations normally reserved for specific business problem solutions. Those topics we’ll cover in more detailed posts about those specific areas of analysis.
As for when and where EDA occurs in the traditional analytics and data science life-cycle, we can see in the below that EDA is one of the first and most critical steps before we proceed to any type of productionization of an algorithm. Within the CRISP-DM (Cross Industry Standard Process for Data Mining) we perform EDA in the Data Understanding phase of our initial analysis review.
In this post, we’ll go over at a very high level some of the Business Understanding steps, but mainly we will focus on the second step of the CRISP-DM framework.
Explain each library
Import code for each library
import numpy as np import seaborn as sns import matplotlib.pyplot as plt import pandas as pd
%matplotlib inline sns.set(color_codes=True)
Provide a reference to Pandas Ecosystem in What is Pandas
We’ll be using an open-source dataset from FSU on Home sale statistics. It contains data on fifty home sales, with selling price, asking price, living space, rooms, bedrooms, bathrooms, age, acreage, taxes. There is also an initial header line, which we will modify in our data loading steps below:
file_name = "https://people.sc.fsu.edu/~jburkardt/data/csv/homes.csv" df = pd.read_csv(file_name)
df.columns = ['Sell', 'List', 'Living', 'Rooms', 'Beds', 'Baths', 'Age', 'Acres', 'Taxes']
df = df.drop_duplicates() df.count()
One additional step we should be taking as a part of our evaluation of the data is to see whether the datatypes loaded in the original dataset match our descriptive understanding of the underlying data.
One example of this is that all data in our dataset is being read in as integer and float values. We should change at least two of the int64 variables into categorical datatypes – Beds & Rooms. This is done as those data points are of limited and fixed number of possible values.
df['Beds'] = df['Beds'].astype('category') df['Rooms'] = df['Rooms'].astype('category') df.dtypes