EDA with Python is a critical skill for all data analysts, scientists, and even data engineers. EDA, or Exploratory Data Analysis, is the act of analyzing a dataset to understand the main statistical characteristics with visual and statistical methods.

“The greatest value of a picture is when it forces us to notice what we never expected to see

John W. Tukey.”

John Tukey defined the main process for statisticians, and now data analysts, to explore data to enable the creation of hypotheses about a given dataset. The steps that should be taken have varied since Tukey came up with this process in 1961. However, many of the basics have not changed including the following:

- Loading and Understanding Data Definitions in Your Data
- Checking the Contents of the Data For Issues
- Assessing the Data Types
- Extracting Summary Statistics
- Generation of Data Visualizations:
- Boxplots
- Scatter Plots

- Creating Advanced Statistics
- Correlation Matrices

Where we stop currently in this post is jumping into advanced diagnostics for EDA for the purposes of checking regression analysis, classifier analysis, and clustering analysis – operations normally reserved for specific business problem solutions. Those topics we’ll cover in more detailed posts about those specific areas of analysis.

As for when and where EDA occurs in the traditional analytics and data science life-cycle, we can see in the below that EDA is one of the first and most critical steps before we proceed to any type of productionization of an algorithm. Within the CRISP-DM (Cross Industry Standard Process for Data Mining) we perform EDA in the *Data Understanding* phase of our initial analysis review.

In this post, we’ll go over at a very high level some of the Business Understanding steps, but mainly we will focus on the second step of the CRISP-DM framework.

Explain each library

Import code for each library

```
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
```

```
%matplotlib inline
sns.set(color_codes=True)
```

Provide a reference to Pandas Ecosystem in What is Pandas

We’ll be using an open-source dataset from FSU on Home sale statistics. It contains data on fifty home sales, with selling price, asking price, living space, rooms, bedrooms, bathrooms, age, acreage, taxes. There is also an initial header line, which we will modify in our data loading steps below:

```
file_name = "https://people.sc.fsu.edu/~jburkardt/data/csv/homes.csv"
df = pd.read_csv(file_name)
```

```
df.columns = ['Sell', 'List', 'Living', 'Rooms', 'Beds', 'Baths',
'Age', 'Acres', 'Taxes']
```

`df.head()`

- Data column explanations/definitions

- Reference to columns content post

`df.dtypes`

`df.count()`

```
df = df.drop_duplicates()
df.count()
```

`df.isnull().sum()`

`df.isnull().values.any()`

One additional step we should be taking as a part of our evaluation of the data is to see whether the datatypes loaded in the original dataset match our descriptive understanding of the underlying data.

One example of this is that all data in our dataset is being read in as integer and float values. We should change at least two of the int64 variables into categorical datatypes – Beds & Rooms. This is done as those data points are of limited and fixed number of possible values.

```
df['Beds'] = df['Beds'].astype('category')
df['Rooms'] = df['Rooms'].astype('category')
df.dtypes
```

`df.describe()`

- Pandas
- Identify outliers:
- Explain the process and meaning

- — The range of the data provides us with a measure of spread and is equal to a value between the smallest data point (min) and the largest one (Max)
- — The interquartile range (IQR), which is the range covered by the middle 50% of the data.
- — IQR = Q3 – Q1, the difference between the third and first quartiles. The first quartile (Q1) is the value such that one quarter (25%) of the data points fall below it, or the median of the bottom half of the data. The third quartile is the value such that three quarters (75%) of the data points fall below it, or the median of the top half of the data.
- — The IQR can be used to detect outliers using the 1.5(IQR) criteria. Outliers are observations that fall below Q1 – 1.5(IQR) or above Q3 + 1.5(IQR).

- Show outliers

- BoxPlot by Categorical variables

- Matrix Plot with Pandas/SNS
- Individual plots with

Pair Plots

https://kite.com/blog/python/data-analysis-visualization-python/

https://towardsdatascience.com/hitchhikers-guide-to-exploratory-data-analysis-6e8d896d3f7e

https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce

https://infovis-wiki.net/wiki/Exploratory_Data_Analysis_(EDA)

- Taxes and the Sell price appear highly, positively, correlated
- Acres compared with Taxes and Sell price do not have strong correlations
- Age and Taxes, Sell, and List prices all appear to be negatively correlated
- List and Sell prices are positively correlated in the positive direction

- https://kite.com/blog/python/data-analysis-visualization-python/
- https://towardsdatascience.com/hitchhikers-guide-to-exploratory-data-analysis-6e8d896d3f7e
- https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce
- https://infovis-wiki.net/wiki/Exploratory_Data_Analysis_(EDA)
- https://genomicsclass.github.io/book/pages/exploratory_data_analysis.html
- https://towardsdatascience.com/supervised-machine-learning-workflow-from-eda-to-api-f6a7719ad897
- https://www.kaggle.com/ekami66/detailed-exploratory-data-analysis-with-python
- https://medium.com/@swastiknayak76/data-science-life-cycle-and-exploratory-data-analysis-with-python-f8005febe131

Introduction In this article, we are going to get a detailed explanation of Matplotlib Visualizations in Python. Matplotlib is the…

4 days ago

A Brief Introduction Pandas is an Open Source library built on top of NumPy. It allows for fast analysis and…

1 month ago

Importance of Merging & Joining Data Many need to join data with Pandas, however there are several operations that are…

2 months ago

Despite the mass investment by third parties to provide API access to reports and data that their customers want, email…

2 months ago

Pandas is one of the most popular libraries for data analysis in the world and is growing rapidly. But, what…

2 months ago

JSON is one of the most common data formats available in digital and non-digital applications. As a result, there it…

3 months ago