Classification Scoring Functionalities with Scikit-Learn

Supervised learning is a type of machine learning which deals with regression and classification. When it comes to machine learning, classification functionalities are often used. Just running the classification functions are not going to fulfill the objective of machine learning. There must be a performance evaluation justifying the accuracy of the classification features.

Outline of this Article

This article will use the MNIST dataset available in Sci-Kit Learn dataset library and demonstrate three different types of classifications – Binary Classification, Multiclass Classification, and Multi-Label Classification (Preparing the Data & Classifications section). Later on, we will measure the performance of these classifications using Sci-Kit Learn model evaluation scoring techniques (Classification Scoring Functions section).

Preparing the Data & Classifications

In this section, we will prepare the dataset & classification models, which will help us alongside this article.

Preparing the Data

The MNIST dataset contains 70000 tiny images.

>>> from sklearn.datasets import fetch_openml

>>> mNist_DataSet = fetch_openml(‘mnist_784’, version=1)

>>> print(mNist_DataSet.keys())

[‘data’,’target’, ‘feature_names’, ‘DESCR’, ‘details’, ‘categories’, ‘url’])

We will only focus on data (the image) and target (label for image) information.

>>> data = mNist_DataSet[“data”]

>>> label = mNist_DataSet[“target”]

Let us look at an example image in the dataset.

>>> import matplotlib as mpl

>>> import matplotlib.pyplot as plot

>>> imageArray = data[1]

>>> image = imageArray.reshape(28, 28)

>>> plot.imshow(image, cmap=”binary”)

>>> plot.axis(“off”)


Figure 1: An example image from the dataset

>>> label[1]


The image (Figure 1) is a number zero (0)and the corresponding label is ‘0’ in string.

label  = label.astype(np.uint8) // use type-casting to convert them from string to integer.

Now, we will create separate variables for test & training “data” (data = mNist_DataSet[“data”]) and test & training “label” (label = mNist_DataSet[“target”]).

data_train, data_test, label_train, label_test = data[:60000], data[60000:], label[:60000], label[60000:]

Note: We will use Figure 1 all over this article (these are the places where imageArray variable is used). Also keep in mind that this figure represents number 0 and the label is ‘0’ in string. Most of the places our classifiers will predict output using this image. 

Preparing the Classification Models

Now we are ready with the training data and test data so, let’s move to classification the data.

Binary Classification

In Scikit learn SGDClassifier model is an example of binary classification.

>>> label_train_0 = (label_train == 0)

>>> label_test_0 = (label_test == 0)

Purpose of this training is for the classifier to understand if a number is 0 or NOT.

>>> from sklearn.linear_model import SGDClassifier

>>> sgdClassification = SGDClassifier(random_state=42)

>>>, label_train_0)
>>> print(sgdClassification.predict([imageArray]))
[ True]

Testing the model: Prediction is ‘true‘ – we input the figure 1 and prediction is true (is 0)! We can determine the training is successful and classification works in order.

Multiclass Classification

In Sci-kit Learn, Support Vector Machine (SVM) Classifier model is an example of multiclass classification. The purpose of this training is for the classifier to understand the correct label for an image.

from sklearn.svm import SVC

>>>svmClassification = SVC(gamma=’scale’)

>>>, label_train)

>>> print(svmClassification.predict([imageArray]))


Testing the model: Prediction is ‘[0]’ – we input figure 1 and the model predicted the label 0! We can determine the training is successful and classification works in order.

Multi-Table Classification

In Scikit learn KNeighborsClassifier model is an example of multi-table classification.

>>> from sklearn.neighbors import KNeighborsClassifier

>>> label_train_large = (label_train >= 7)
>>> label_train_odd = (label_train % 2 == 1)
>>> multilabelArray = np.c_[label_train_large, label_train_odd]

>>> knnClassification = KNeighborsClassifier()
>>>, multilabelArray)

Purpose of this training is to determine if a given image is greater than or equal to 7 or an odd number.

>>> print(knnClassification.predict([imageArray]))

[[False False]]

Testing the model: Prediction is ‘[[False False]]’ – we input figure 1 which is not more than 7 or odd! We can determine the training is successful and classification works in order.

Classification Scoring Functions

Once you apply the classification models for your machine learning tasks, performance evaluation would help you to determine the accuracy. Using scoring functionalities available in Sci-Kit Learn would be an ideal solution for performance evaluation.

There are three ways of conducting a performance evaluation of classification model predictions – estimator score method, scoring parameter, and metric function. In this article, we will discuss only the Metrics module available in Sci-Kit Learn.

The SciKit Metrics module has several sub-functions to evaluate the classification models. We will focus on a few of the scoring strategies available.

Binary Classification Scoring

In this section, we will demonstrate three main scoring functionalities – Confusion Matrix, Precision and Recall, and the ROC curve, and evaluate our binary classification.

Confusion Matrix

One way to determine the performance evaluation of classification models is by using the confusion matrix. The objective of this metric is to find the number of times the model got confused. For example, how a class A (number 0) has got confused with another class B (say number 9).

Note: throughout this article, we will use train models to identify differences between 0 & 9.

In order to complete the confusion matrix, we need a set of predictions to compare with our model output. Here we will use the cross_val_predict() function to set our prediction dataset using the training data.

>>> from sklearn.model_selection import cross_val_predict
>>> label_train_pred = cross_val_predict(sgdClassification, data_train, label_train_0, cv=3)

The cross_val_predict() function will perform K-fold cross-validation, which returns predictions done on each test fold. This model predicts without looking at data during training (known as the clean prediction).

Now let’s use the confusion matrix passing the target class and predicted class. Then, let’s have a look at how the functions score the dataset.

>>> from sklearn.metrics import confusion_matrix
>>> print(confusion_matrix(label_train_0, label_train_pred))
[[53717 360]
  [ 507 5416]]

The score 53717 is for true-negative (classification as not-0). 

The score 360 is for false-positive (classification as 0). 

The score 507 is for false-negative (classification as not-0). 

The score 5416 is for true-positive (classification as 0).

A perfect classifier would have only true-positives & true-negatives.

Precision and Recall

Sci-Kit Learn provides another model evaluation scoring function known as precision scoring and recall scoring.

>>> from sklearn.metrics import precision_score, recall_score

>>> print(precision_score(label_train_0, label_train_pred))


>>> print(recall_score(label_train_0,  label_train_pred))


According to precision and recall score, we can determine 93.7% of the time an image representation of 0 is correct, and 91.4% can correctly detect an image with 0.

The combination of precision & recall function is known as F1-Score. Sci-Kit Learn has this method under the f1_score() function. Let’s see what the score we receive using this function.

>>> from sklearn.metrics import f1_score

>>> print(f1_score(label_train_0, label_train_pred))


According to the f1_score() function the accuracy score is 92.5%.

The ROC Curve

Receiver Operating Characteristic (ROC) Curve can determine the scores for a binary classifier model. You can plot the True-Positive Rate Vs. The False-Positive Rate using ROC’s graphical representation. SciKit Learn provides the roc_curve() function to achieve this scoring functionality.

For this purpose, we will use the cross_val_predict() function to get decision scores of all instances in the training. We will use the outcome to score our model.

>>> from sklearn.metrics import roc_curve 

>>> label_scores = cross_val_predict(sgdClassification, data_train, label_train_0, cv=9,  method=”decision_function”)

>>> falsePositiveRate, truePositiveRate, thresholds = roc_curve(label_train_0, label_scores)

Using the above information, let us plot the graph.

def plot_roc_curve(fpr, tpr, label=None):
plot.plot(falsePositiveRate, truePositiveRate, linewidth=2, label=label)
plot.plot([0, 1], [0, 1], ‘k–‘) # Dashed diagonal
plot.xlabel(‘False-Postive Rate’)
plot.ylabel(‘True-Postive Rate’)
plot_roc_curve(falsePositiveRate, truePositiveRate)

As the blue line closely touches the y-axis, the accuracy could be closer to 100. We can confirm this with the roc_auc_score() function. You can measure the Area Under the Curve to find the score of the model prediction. A perfect classification model should have a value of 1. Using SciKit learn, you can use the roc_auc_score() function to find the score.

>>> from sklearn.metrics import roc_auc_score 

>>> print(roc_auc_score(label_train_0, label_scores))


As our assumption, the score is 99.5%, which is almost closer to 100.

Scoring Multi-Class Classification

We can score our Multiclass Classification using decision_function() available in the Support Vector Machine (SVM) classifier. We have confirmed that the SVM model prediction was correct, and therefore we will find a score for accuracy. As usual, we will input the image in Figure 1.

>>> svmScores = svmClassification.decision_function([imageArray])
>>> print(svmScores)
[[ 9.31776763  0.69966542 8.26937495  3.82063539 -0.30671293 7.27141643  3.80978873 1.72165536 6.0316466 3.83885601]]

The decision_function() assigns a score to each class. Score 9.3 is the highest according to the output. This function has taken Figure 1 as input and predicted the label for each class. The class that corresponds to 0 has the highest score. We can also note that numbers 2, 5, and 8 have higher scores – based on this, you can decide how accurate the SVM classifier is.

Scoring Multi-Label Classification

F1 Scoring is one of the best ways to evaluate the performance of Multi-Label Classification. Using the F1 score, we can set a score to an individual label and then find the average of all the labels.

>>> label_train_knnPrediction = cross_val_predict(knnClassification, data_train, y_multilabel, cv=9)

>>> print(f1_score(y_multilabel, label_train_knnPrediction, average=”macro”))


The F1 score returns 97.7% for Multi-Label Classification.


This article has limited the number of classification models & scoring functions explained. There are several other methods available too. You can find them from Sci-Kit official website. We hope this article gave you basic knowledge of classification and how to evaluate them. We recommend you to use the same style with other available functions and practices.