Search Results for: supervised machine learning

Classification Model Evaluation Metrics in Scikit-Learn

This article contains affiliate links. For more, please read the T&Cs.

Classification

One of the two major types of predictive modeling in supervised machine learning is classification. The other being regression, which was discussed in an earlier article. Classification involves predicting the specific class (of the target variable) of a particular sample from a population, where the target variables are discrete categorical values and not continuous real numbers. A couple of examples of classification problems include:

  • Disease Detection: Classifying blood test results to predict whether a patient has diabetes or not (2 target variable classes). This is an example of binary classification
  • Image Classification: Handwriting recognition of letters (26 classes) and numbers (9 numbers). This is an example of multi-class classification

Model Evaluation

A Classification model’s performance can only be as good as the metric used to evaluate it. If an incorrect evaluation metric is used to select and tune the classification model parameters, be it logistic regression or random forest, the model’s real-world application will completely be in vain.

One of the critical aspects when considering a classification model’s evaluation metric is that a simple accuracy metric (i.e. calculating whether each classification prediction was correct or incorrect) is not generally an appropriate metric, especially when the training dataset is imbalanced. An Imbalanced dataset refers to one where the number of samples in the training dataset for each class label is not balanced and the class distribution is not equal or close to equal. This could be because of two potential reasons – one, the real world training data and occurrence of each class itself is imbalanced; or second, that the training data is inherently biased or skewed.

For example, if a classification model is intended to predict fraudulent transactions from a dataset where 90% of the samples are not fraud and 10% are fraud, then a naive classifier, regardless of input, will be 90% accurate on average. That means, in a dataset out of 100 samples where 10 are actually fraudulent, if the model were to predict that all 100 were not fraudulent, then the accuracy metric will yield a 90% accuracy of the model, which is misleading to say the least about the model’s performance.   

However, there are a myriad of ways of evaluating classification model performance, other than just accuracy, each having their own use cases and strengths and weaknesses.  Each evaluation metric makes some assumptions about the problem or about what it is that is important in the context of the problem. Therefore, an evaluation metric must be chosen that best captures the intent of the problem and what it is that is being classified, which makes choosing model evaluation metrics a challenging undertaking.

Most machine learning engineers and data scientists who use Python, use the Scikit-learn library, which contains built-in functions for model performance evaluation. In this article, we will walk through 7 of the most widely used metrics, implement them and explore their uses cases with their advantages and disadvantages, as listed below.

  1. Accuracy Score
  2. Recall/Sensitivity
  3. Precision
  4. F-Score
  5. Classification Report
  6. Receiver Operating Characteristic (ROC) Curve
  7. Area Under ROC Curve

The Building Blocks

Before we delve into the details of each of the metrics, it is important to cover the four building blocks used to define the evaluation metrics:

  • True Positive (TP) – Actual label is positive and prediction is also positive
  • True Negative (TN) – Actual label is negative and prediction is also negative
  • False Positive (FP) – Actual label is negative but prediction is positive
  • False Negative (FN) – Actual label is positive but prediction is negative
Actual Label vs Predicted Label
Actual Label vs Predicted Label

Remember, the definition of “True”, “False” depends on the objective of the problem. If the classifier model’s objective is to detect patients that have diabetes, then “True” refers to samples, patients, who have diabetes.

Dataset Extraction and Model Implementation

For our walkthroughs, we will be using the diabetes dataset from Kaggle, which is a binary classification dataset. The dataset is first extracted below using Pandas.

import pandas as pd
diabetes_df=pd.read_csv('diabetes.csv')
diabetes_df.head()
Diabetes Dataset dataframe
The first few entries of the diabetes dataset. ‘Outcome’ of ‘O’ represents a negative test result for diabetes and ‘1’ represents a positive for diabetes

Next, let’s explore the balance of the target variable ‘Outcome’, to see how balanced the dataset is.

diabetes_df.Outcome.value_counts()
OutcomeFrequency
0500
1268

As we can see above, the target variable is heavily skewed towards the Outcome value of ‘O’.

Next, we implement the classification model on the dataset using a basic k-Nearest Neighbour (kNN) classifier and an 80-20 train test split. As you can see below, most of the libraries used below for splitting the dataset as well as model implementation are used from the Scikit-Learn library. To be consistent with the scope of this article, we will not delve too in-depth with the selection of the classification model, but feel free to explore other classification models such as SVC, Random Forest, Logistic Regression, GBM, etc.

x = diabetes_df.drop('Outcome',axis=1).values
y = diabetes_df['Outcome'].values
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1, stratify=y)
from sklearn.neighbors import KNeighborsClassifier
#Create kNN (k Nearest Neighbor) classifier, with k value of 15
knn = KNeighborsClassifier(n_neighbors = 15)
#Fit the classifier to the data
knn.fit(x_train,y_train)
y_pred = knn.predict(x_test)

We print out the first 15 samples with their actual target variable and the predicted target variable by the k-NN classifier just to gauge the classifier’s ability.

pd.DataFrame(data={'Predicted': y_pred, 'Actual': y_test}).head(15)
PredictedActual
000
100
200
311
400
500
600
700
810
900
1000
1100
1201
1300
1400

We do a similar treatment on the Iris flower dataset, in terms of dataset extraction and classifier implementation. The Iris flower dataset is a multiclass dataset, which will be used to predict the flower type based on flower petal dimensions.

Next, we dive straight into the evaluation metrics.

1. Accuracy Score

Accuracy is the most basic version of evaluation metrics. It is calculated as the ratio of correct predictions (TP + TN) over all the predictions made (TP + TN + FP + FN).

Accuracy Score

The accuracy score can be obtained from Scikit-learn, which takes as inputs the actual labels and predicted labels

from sklearn.metrics import accuracy_score
print ('accuracy =',metrics.accuracy_score(y_test, y_pred))

Accuracy = 0.74026

Accuracy is also one of the more misused of all evaluation metrics. The only proper use case of the accuracy score is a dataset that is almost perfectly balanced, which is rarely applicable for any real world dataset. The reason for this is that a high accuracy metric is attainable by any no skill/naive classifier model that only predicts the majority class. Additionally, the accuracy metric does not allow Data Scientists to prioritize the importance of True Positives or True Negatives, which we will see later is dependant on the objective of the classifier model.

2. Recall/Sensitivity

Recall, often referred to as Sensitivity or True Positive Rate (TPR), is the fraction/ratio of samples that the classifier model predicted to be the positive class to the samples that actually belongs to the positive class. It basically summarizes how well the positive class was predicted by the classifier.

Recall Score
from sklearn.metrics import recall_score
recall_score(y_test, y_pred)

Recall = 0.44444

As you can see, the score is significantly less for recall than it was for accuracy.

Recall is the go-to metric when there is a high cost associated with a False Negative. For example, a potential use case for this is in sick patient detection, perfect for our diabetes dataset. If a sick patient (actual Positive) goes through the classifier model and is predicted as not sick (predicted Negative), that is definitely less desirable than its reverse case. The cost
associated with a False Negative will be extremely high if the sickness goes undetected by the classifier and also happens to be contagious. Therefore, for such a model as in our diabetes case, when you do hyperparameter tuning for the classifier model, you would want to tune it to maximize the recall evaluation metric, since as aforementioned the model performs quite poorly in terms of meeting its objective when you look at recall.

3. Precision

Precision, often referred to as Positive Predicted Value (PPV), is the fraction of samples that the classifier model predicted to be the positive class to the total of number samples that were predicted to be in the positive class. It summarizes how precise the model is out of those predicted as positive; how many of them actually are positive.

Precision Score
from sklearn.metrics import precision_score
precision_score(y_test, y_pred)

Precision = 0.70588

Compared to the recall score, this classifier model performs much better on the precision evaluation metric, very close to the accuracy score.

Precision is a good metric to use when the cost of False Positive is high. In this case, for instance, a potential use case for precision as the evaluation metric is in spam vs ham email classifier. In email spam detection, a false
positive means that an email that is actually non-spam (actual negative) has been classified as spam (predicted as spam). As a result, the user might lose important emails to the junk/spam folder if the precision is optimized for a spam detection model. Therefore, it appears that our k-NN classifier is actually optimized for a spam detection classifier, given that it has a higher precision score than recall.

4. F-Score

F-Score, often referred to as F-Measure, is a harmonic mean of precision and recall.

from sklearn.metrics import f1_score
f1_score(y_test, y_pred)

F-Score = 0.545454

As a result of being the harmonic mean of precision and recall, the F-Score nestles in between the two in terms of the score.

In a lot of applications, there is some desired balance between precision and recall. Those would be the use cases for F-score. For example, a classifier that has no downstream (negative) impact associated with False Negative versus False Positive, such as a basic image classifier.

5. Classification Report

Scikit-Learn also provides a very convenient summary of precision, recall, and F-score through its classification report.

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
Classification Repport

6. ROC Curve

A receiver operating characteristic (ROC) curve, is a diagnostic plot that visualizes the behavior of a binary classifier model by calculating the false positive rate and true positive rate by changing the model’s classification/discrimination thresholds. It is essentially a plot of signal (True Positive Rate) versus noise (False Positive Rate)

Going back to the basics, the threshold value is used to define which prediction probability is set to label a given test sample as predicted positive or predicted negative during the classification step. For most models, the default threshold value is 0.5.

import scikitplot as skplt
y_probas=knn.predict_proba(x_test)
skplt.metrics.plot_roc(y_test, y_probas, figsize=(10, 8))
ROC Curve

Note that for a multiclass classification problem, the individual ROC curves for each class will be a One vs Rest plot.

For reference, below is an illustration comparing a good and bad ROC curve.

ROC Curve

A classifier that does a very good job of distinguishing between the classes will have a ROC curve that hugs the top left corner. The perfect diagonal line is a no skilled model that does no better than a random guessing model. It is often a good idea to plot ROC curves for different threshold values of the classification models and see which performs the best.

However, as you can see the plot if of a visualization technique and does not output a quantitative score as the evaluation metric. That is where the next metric comes in.

7. ROC Area Under Curve (AUC)

As the name suggests, the ROC AUC calculates the area under the ROC curve and provides a single score as an evaluation metric. As seen in the visualization, the larger the area under the curve, the more skilled the classifier and vice versa i.e. and ROC AUC of 1 is considered a perfect skill classifier.

from sklearn.metrics import roc_auc_score
probs = y_probas[:, 1]
print ('ROC AUC =', roc_auc_score(y_test, probs))

ROC-AUC = 0.7865

Final Thoughts

The above are just a few of the more common evaluation metrics used in Classification Models. There are few others used out there as well such as Precision-Recall Curve. Feel free to explore them and research their use cases. As we have seen above, the evaluation metric (or a combination thereof) that should be used for a given classification model, totally depends on the model’s objectives and the business problem context at hand. You must narrow down your evaluation metric first before you move onto the next stage of hyperparameter tuning for that model’s parameters.

In case you want to access the ipnyb code file, you can find them here. Additionally, more examples of model evaluation can be found in the book Chapter 5 of Introduction to Machine Learning with Python.

Classification Scoring Functionalities with Scikit-Learn

Supervised learning is a type of machine learning which deals with regression and classification. When it comes to machine learning, classification functionalities are often used. Just running the classification functions are not going to fulfill the objective of machine learning. There must be a performance evaluation justifying the accuracy of the classification features.

Outline of this Article

This article will use the MNIST dataset available in Sci-Kit Learn dataset library and demonstrate three different types of classifications – Binary Classification, Multiclass Classification, and Multi-Label Classification (Preparing the Data & Classifications section). Later on, we will measure the performance of these classifications using Sci-Kit Learn model evaluation scoring techniques (Classification Scoring Functions section).

Preparing the Data & Classifications

In this section, we will prepare the dataset & classification models, which will help us alongside this article.

Preparing the Data

The MNIST dataset contains 70000 tiny images.

>>> from sklearn.datasets import fetch_openml

>>> mNist_DataSet = fetch_openml(‘mnist_784’, version=1)

>>> print(mNist_DataSet.keys())

[‘data’,’target’, ‘feature_names’, ‘DESCR’, ‘details’, ‘categories’, ‘url’])

We will only focus on data (the image) and target (label for image) information.

>>> data = mNist_DataSet[“data”]

>>> label = mNist_DataSet[“target”]

Let us look at an example image in the dataset.

>>> import matplotlib as mpl

>>> import matplotlib.pyplot as plot

>>> imageArray = data[1]

>>> image = imageArray.reshape(28, 28)

>>> plot.imshow(image, cmap=”binary”)

>>> plot.axis(“off”)

>>> plot.show()

Figure 1: An example image from the dataset

>>> label[1]

‘0’

The image (Figure 1) is a number zero (0)and the corresponding label is ‘0’ in string.

label  = label.astype(np.uint8) // use type-casting to convert them from string to integer.

Now, we will create separate variables for test & training “data” (data = mNist_DataSet[“data”]) and test & training “label” (label = mNist_DataSet[“target”]).

data_train, data_test, label_train, label_test = data[:60000], data[60000:], label[:60000], label[60000:]

Note: We will use Figure 1 all over this article (these are the places where imageArray variable is used). Also keep in mind that this figure represents number 0 and the label is ‘0’ in string. Most of the places our classifiers will predict output using this image. 

Preparing the Classification Models

Now we are ready with the training data and test data so, let’s move to classification the data.

Binary Classification

In Scikit learn SGDClassifier model is an example of binary classification.

>>> label_train_0 = (label_train == 0)

>>> label_test_0 = (label_test == 0)

Purpose of this training is for the classifier to understand if a number is 0 or NOT.

>>> from sklearn.linear_model import SGDClassifier

>>> sgdClassification = SGDClassifier(random_state=42)

>>> sgdClassification.fit(data_train, label_train_0)
>>> print(sgdClassification.predict([imageArray]))
[ True]

Testing the model: Prediction is ‘true‘ – we input the figure 1 and prediction is true (is 0)! We can determine the training is successful and classification works in order.

Multiclass Classification

In Sci-kit Learn, Support Vector Machine (SVM) Classifier model is an example of multiclass classification. The purpose of this training is for the classifier to understand the correct label for an image.

from sklearn.svm import SVC

>>>svmClassification = SVC(gamma=’scale’)

>>> svmClassification.fit(data_train, label_train)

>>> print(svmClassification.predict([imageArray]))

[0]

Testing the model: Prediction is ‘[0]’ – we input figure 1 and the model predicted the label 0! We can determine the training is successful and classification works in order.

Multi-Table Classification

In Scikit learn KNeighborsClassifier model is an example of multi-table classification.

>>> from sklearn.neighbors import KNeighborsClassifier

>>> label_train_large = (label_train >= 7)
>>> label_train_odd = (label_train % 2 == 1)
>>> multilabelArray = np.c_[label_train_large, label_train_odd]

>>> knnClassification = KNeighborsClassifier()
>>> knnClassification.fit(data_train, multilabelArray)

Purpose of this training is to determine if a given image is greater than or equal to 7 or an odd number.

>>> print(knnClassification.predict([imageArray]))

[[False False]]

Testing the model: Prediction is ‘[[False False]]’ – we input figure 1 which is not more than 7 or odd! We can determine the training is successful and classification works in order.

Classification Scoring Functions

Once you apply the classification models for your machine learning tasks, performance evaluation would help you to determine the accuracy. Using scoring functionalities available in Sci-Kit Learn would be an ideal solution for performance evaluation.

There are three ways of conducting a performance evaluation of classification model predictions – estimator score method, scoring parameter, and metric function. In this article, we will discuss only the Metrics module available in Sci-Kit Learn.

The SciKit Metrics module has several sub-functions to evaluate the classification models. We will focus on a few of the scoring strategies available.

Binary Classification Scoring

In this section, we will demonstrate three main scoring functionalities – Confusion Matrix, Precision and Recall, and the ROC curve, and evaluate our binary classification.

Confusion Matrix

One way to determine the performance evaluation of classification models is by using the confusion matrix. The objective of this metric is to find the number of times the model got confused. For example, how a class A (number 0) has got confused with another class B (say number 9).

Note: throughout this article, we will use train models to identify differences between 0 & 9.

In order to complete the confusion matrix, we need a set of predictions to compare with our model output. Here we will use the cross_val_predict() function to set our prediction dataset using the training data.

>>> from sklearn.model_selection import cross_val_predict
>>> label_train_pred = cross_val_predict(sgdClassification, data_train, label_train_0, cv=3)

The cross_val_predict() function will perform K-fold cross-validation, which returns predictions done on each test fold. This model predicts without looking at data during training (known as the clean prediction).

Now let’s use the confusion matrix passing the target class and predicted class. Then, let’s have a look at how the functions score the dataset.

>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(label_train_0, label_train_pred))
[[53717 360]
  [ 507 5416]]

The score 53717 is for true-negative (classification as not-0). 

The score 360 is for false-positive (classification as 0). 

The score 507 is for false-negative (classification as not-0). 

The score 5416 is for true-positive (classification as 0).

A perfect classifier would have only true-positives & true-negatives.

Precision and Recall

Sci-Kit Learn provides another model evaluation scoring function known as precision scoring and recall scoring.

>>> from sklearn.metrics import precision_score, recall_score

>>> print(precision_score(label_train_0, label_train_pred))

0.9376731301939059

>>> print(recall_score(label_train_0,  label_train_pred))

0.9144014857335809 

According to precision and recall score, we can determine 93.7% of the time an image representation of 0 is correct, and 91.4% can correctly detect an image with 0.

The combination of precision & recall function is known as F1-Score. Sci-Kit Learn has this method under the f1_score() function. Let’s see what the score we receive using this function.

>>> from sklearn.metrics import f1_score

>>> print(f1_score(label_train_0, label_train_pred))

0.925891101803573

According to the f1_score() function the accuracy score is 92.5%.

The ROC Curve

Receiver Operating Characteristic (ROC) Curve can determine the scores for a binary classifier model. You can plot the True-Positive Rate Vs. The False-Positive Rate using ROC’s graphical representation. SciKit Learn provides the roc_curve() function to achieve this scoring functionality.

For this purpose, we will use the cross_val_predict() function to get decision scores of all instances in the training. We will use the outcome to score our model.

>>> from sklearn.metrics import roc_curve 

>>> label_scores = cross_val_predict(sgdClassification, data_train, label_train_0, cv=9,  method=”decision_function”)

>>> falsePositiveRate, truePositiveRate, thresholds = roc_curve(label_train_0, label_scores)

Using the above information, let us plot the graph.

def plot_roc_curve(fpr, tpr, label=None):
plot.plot(falsePositiveRate, truePositiveRate, linewidth=2, label=label)
plot.plot([0, 1], [0, 1], ‘k–‘) # Dashed diagonal
plot.xlabel(‘False-Postive Rate’)
plot.ylabel(‘True-Postive Rate’)
plot_roc_curve(falsePositiveRate, truePositiveRate)
plot.show()

As the blue line closely touches the y-axis, the accuracy could be closer to 100. We can confirm this with the roc_auc_score() function. You can measure the Area Under the Curve to find the score of the model prediction. A perfect classification model should have a value of 1. Using SciKit learn, you can use the roc_auc_score() function to find the score.

>>> from sklearn.metrics import roc_auc_score 

>>> print(roc_auc_score(label_train_0, label_scores))

0.995201351056529

As our assumption, the score is 99.5%, which is almost closer to 100.

Scoring Multi-Class Classification

We can score our Multiclass Classification using decision_function() available in the Support Vector Machine (SVM) classifier. We have confirmed that the SVM model prediction was correct, and therefore we will find a score for accuracy. As usual, we will input the image in Figure 1.

>>> svmScores = svmClassification.decision_function([imageArray])
>>> print(svmScores)
[[ 9.31776763  0.69966542 8.26937495  3.82063539 -0.30671293 7.27141643  3.80978873 1.72165536 6.0316466 3.83885601]]

The decision_function() assigns a score to each class. Score 9.3 is the highest according to the output. This function has taken Figure 1 as input and predicted the label for each class. The class that corresponds to 0 has the highest score. We can also note that numbers 2, 5, and 8 have higher scores – based on this, you can decide how accurate the SVM classifier is.

Scoring Multi-Label Classification

F1 Scoring is one of the best ways to evaluate the performance of Multi-Label Classification. Using the F1 score, we can set a score to an individual label and then find the average of all the labels.

>>> label_train_knnPrediction = cross_val_predict(knnClassification, data_train, y_multilabel, cv=9)

>>> print(f1_score(y_multilabel, label_train_knnPrediction, average=”macro”))

0.977410268890205

The F1 score returns 97.7% for Multi-Label Classification.

Conclusion

This article has limited the number of classification models & scoring functions explained. There are several other methods available too. You can find them from Sci-Kit official website. We hope this article gave you basic knowledge of classification and how to evaluate them. We recommend you to use the same style with other available functions and practices.

Tips for Performing EDA With Python

This article contains affiliate links. For more, please read the T&Cs.

What is Exploratory Data Analysis (EDA)?

EDA with Python is a critical skill for all data analysts, scientists, and even data engineers. EDA, or Exploratory Data Analysis, is the act of analyzing a dataset to understand the main statistical characteristics with visual and statistical methods.

“The greatest value of a picture is when it forces us to notice what we never expected to see.

John W. Tukey

John Tukey defined the main process for statisticians, and now data analysts, to explore data to enable the creation of hypotheses about a given dataset. The steps that should be taken have varied since Tukey came up with this process in 1961. However, many of the basics have not changed including the following:

  • Loading and Understanding Data Definitions in Your Data
  • Checking the Contents of the Data For Issues
  • Assessing the Data Types
  • Extracting Summary Statistics
  • Generation of Data Visualizations:
    • Boxplots
    • Scatter Plots
  • Creating Advanced Statistics
    • Correlation Matrices

Where we stop currently in this post is jumping into advanced diagnostics for EDA for the purposes of checking regression analysis, classifier analysis, and clustering analysis – operations normally reserved for specific business problem solutions. Those topics we’ll cover in more detailed posts about those specific areas of analysis.

As for when and where EDA occurs in the traditional analytics and data science life-cycle, we can see in the below that EDA is one of the first and most critical steps before we proceed to any type of productionization of an algorithm. Within the CRISP-DM (Cross Industry Standard Process for Data Mining) we perform EDA in the Data Understanding phase of our initial analysis review.

CRISP-DM, or the Cross Industry Standard Practice for Data Mining, diagram.
CRISP-DM diagram

In this post, we’ll go over at a very high level some of the Business Understanding steps, but mainly we will focus on the second step of the CRISP-DM framework.

Python Libraries For EDA

Explain each library

Import code for each library

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline 
sns.set(color_codes=True)

Provide a reference to Pandas Ecosystem in What is Pandas

Load the Sample Data

We’ll be using an open-source dataset from FSU on Home sale statistics. It contains data on fifty home sales, with selling price, asking price, living space, rooms, bedrooms, bathrooms, age, acreage, taxes. There is also an initial header line, which we will modify in our data loading steps below:

file_name = "https://people.sc.fsu.edu/~jburkardt/data/csv/homes.csv"
df = pd.read_csv(file_name)
df.columns = ['Sell', 'List', 'Living', 'Rooms', 'Beds', 'Baths',
       'Age', 'Acres', 'Taxes']
df.head()
  • Data column explanations/definitions

Check Column & Row Contents

df.dtypes
df.count()
df = df.drop_duplicates()
df.count()
df.isnull().sum()
df.isnull().values.any()

Column Data Type Assessment

One additional step we should be taking as a part of our evaluation of the data is to see whether the datatypes loaded in the original dataset match our descriptive understanding of the underlying data.

One example of this is that all data in our dataset is being read in as integer and float values. We should change at least two of the int64 variables into categorical datatypes – Beds & Rooms. This is done as those data points are of limited and fixed number of possible values.

df.dtypes
DataFrame.dtypes
df['Beds'] = df['Beds'].astype('category')
df['Rooms'] = df['Rooms'].astype('category')
df.dtypes
Series.astype(‘category’)

Summary Statistics

Summary statistics are there to give you an overall view of the metrics in your data set at a glance. They include the count of observations, the mean of observations, the standard deviation, min, 25% quartile, 50% quartile, 75% quartile, and the max value in each Series. What isn’t usually included in the outputs of summary statistics are categorical variables or string variables.

To get summary statistics in Pandas, you simply need to use DataFrame.describe() to get an output similar to the below:

df.describe()
DataFrame.Describe() in Pandas

Boxplots

One important part beyond simply pulling summary statistical methods is to get a visual sense of the distributions of various variables within a given Series or Series by category using boxplots.

In the below example we generate a boxplot in the Pandas library. Here we look at List price from our dataset and split the data by the Beds variable.

df[['List','Beds']].boxplot(by='Beds')
boxplot in Pandas

While Pandas does a good job displaying the data, we can also use the Seaborn libraries boxplot function to provide a slightly more visually appealing plot of the same data.

sns.boxplot(x='Beds', y='Sell', data=df,palette='rainbow')
boxplot() in seaborn

Histogram & Jointplot Generation

Histograms, known as distribution plots in Seaborn, are another critical means of looking at our continuous variables in Python. Distribution plots are used to visualize univariate distributions of observations. Anyone who has taken a statistics 101 course would be familiar with them as a concept. They can be used to identify outliers, identify how normal a dataset is, and whether there are potential gaps in your dataset, along with other applications.

Below, we see a simple and elegant command for generating distribution plots using Seaborn on a Series within a DataFrame. In this case the List price of our dataset.

sns.distplot(df['List'])
distplot

While histograms are useful to view, they are still ostensively univatiate in nature, meaning they only show the distribution of one variable at a time. When we want to compare two variables distributions at a time in Python, we can use the joint plot function.

In the below, we show the distributions on the top and right-hand side of the visualization of our List and Sell data points from our DataFrame. Additionally, we also see the observations in a hex plot, an optional plot design within Seaborn, to visualize in a cartesian plane to visualize how the two Series are related.

sns.jointplot(x='List',y='Sell',data=df,kind='hex')
Jointplot

The quick and dirty approach to plotting histograms with just the Pandas library is seen below using the Series.hist() function

df['Sell'].hist()

Scatter Plots & Pair Plots

Another common method of performing bivariate analysis, or comparing more than one variable, is to use scatter plots and pair plots.Scatter plots are useful to show individual values plot on a two dimensional cartesian X & Y plane from two Series in a Pandas DataFrame.

Here, we show a simple scatter plot visualized in Seaborn using Taxes as our value for the X-axis and Sell as our value for the Y-axis.

sns.scatterplot(x="Taxes", y="Sell",data=df)

One additional and useful thing that can be done using Seaborn is to show a bivariate analysis in a scatter plot with an overlay of colored observations using categorical or other values. In the below, we show the relationship between Taxes and Acres in our dataset colored by the number of Baths in each observation (or house).

df.plot.scatter(x='Taxes',y='Acres',c='Baths',colormap='viridis')

Pair plots can play a similar role to individual scatter plots as they provide a variety of visualizations. Pair plots provide bivariate analysis between each variable in a DataFrame, and similar to the scatter plots, can have observations colored by categorical variables. Additionally, the pair plots provides distributions of each individual variable diagonally down the pair plot display.

In the below, we show a pair plot using just the Sell, Taxes, Acres, and Beds variables from our DataFrame (to use all variables makes a much larger visualization which is hard to read).

sns.set(style="ticks")
sns.pairplot(df[["Sell","Taxes","Acres","Beds"]], hue="Beds")

Correlation Matrix

df.corr()
f, ax = plt.subplots(figsize=(10, 8))
corr = df.corr()
sns.heatmap(corr,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)
A correlation matrix produced by DataFrame.corr() and styled by Seaborn.
A correlation matrix produced by DataFrame.corr() and styled by Seaborn.

Data Analysis Summary

While we worked through the examples of EDA in this dataset, we can come away from our view of this data with a few findings.

  • Taxes and the Sell price appear highly, positively, correlated – this is shown in the pair plot and correlation matrix outputs
  • Acres compared with Taxes and Sell price do not have strong correlations – we can see this in the correlation matrix outputs
  • Age and Taxes, Sell, and List prices all appear to be negatively correlated – we can see this in the correlation matrix outputs
  • List and Sell prices are positively correlated in the positive direction – The directionality of correlation is important to understand, and we see this in our scatter plots and correlation matrix

While there are many other items we could discuss in this dataset, the above are just a few of the items we can walk away from this EDA process having learned.

Summary

While performing EDA with Python can seem challenging at first, it is a rather straight forward process as we have shown here.

There is a massive amount of information on EDA that you can find outside of this post, however, we’ve done a near-complete job summarizing all the statistics you may want to examine prior to beginning any advanced analytics on top of your dataset.

Below are some of the best articles, academic libraries, and GitHub repositories on the web showing how EDA with Python can be performed with other example datasets:

We hope you enjoyed this article. For the code used in this article see our GitHub Repo here.