Dimensionality Reduction Using scikit-learn in Python

Datasets with a large number of features are very difficult to analyze. Besides, the amount of computational power that you might need for such a task would be very big. Dimensionality reduction offers a powerful way of dealing with high dimensional data. Dimensionality reduction techniques help us to reduce the dimension of the feature set, without losing much information allowing for robust analysis. Additionally, it can keep, or even improve, the performance of a model generated from the simplified data.

In this article, we present to you a comprehensive guide to three dimensionality reduction techniques. They are available in the scikit-learn library in Python.

Dimensionality Reduction

High-dimensional data presents a challenging task for statistical models. Luckily, much of the data is redundant and can be reduced to a smaller number of variables. It’s possible to do it without losing much information.

Normally, we use dimensionality reduction in machine learning and data exploration. In machine learning, we use it to reduce the number of features. This will decrease the computational power and possibly lead to a better performance of the model.

Similarly, we can use dimensionality reduction to project data into two dimensions. Such visualization can help us to detect outliers or clusters of data.


Principal Component Analysis (PCA)

PCA is the most practical unsupervised learning algorithm. It’s inherently a dimensionality reduction algorithm. If your data has more than 3 dimensions, you can visualize it by using PCA.

PCA projects the data on k orthogonal bases vectors u that minimize the projection error. For instance, let’s say that we have a 2D dataset that has features height and weight. By using PCA we can project this 2D dataset to 1D using the vector u.

Principal Component Analysis Illustration

When we apply PCA to a dataset, it identifies the principal components of data. Such attributes account for the most variance in the data. Moreover, PCA always leads to components that are orthogonal.


When should you use PCA?

It’s important to note that PCA works well with highly correlated variables. If the relationship between variables is weak, PCA won’t be effective. You can look at the correlation matrix to determine whether to use PCA. If most of the coefficients are smaller than 0.3, it’s not a good idea to use PCA.

Additionally, you can look at the correlation coefficients to determine which variables are highly correlated. If you find such variables, you can use only one of them in the analysis. A cut off for highly correlated is usually 0.8.


Linear Discriminant Analysis (LDA)

LDA is a supervised machine learning algorithm. It is most commonly used for dimensionality reduction. The general LDA approach is similar to PCA. LDA finds the components that maximize both the variance of the data and the separation between multiple classes. We often use LDA in preprocessing for classification models.

When should you use LDA?

We can use LDA only for supervised learning. This means that we need to know the class labels in advance.

Some experiments compared classification when using PCA or LDA. These experiments show that classification accuracy tends to improve when using PCA. Finally, the performance of these techniques largely depends on the characteristics of the dataset.


t-distributed Stochastic Neighbouring Entities (t-SNE)

t-SNE is a valuable data visualization technique. It is unsupervised and non-linear. t-SNE has a cost function that is non-convex. Therefore, different initializations can lead to different local minima. If the number of features is very high, it is advised to first use another technique to reduce the number of dimensions.

When should you use t-SNE?

t-SNE places neighbors close to each other, so we cannot clearly see how the samples relate with respect to their features. It is used for data exploration, especially for visualizing high-dimensional data.

t-SNE does not learn a function from the original space to the new one. Because of this, it cannot map the new data according to the previous t-SNE results. In other words, it cannot be used in classification models.


Hands-on Example With the Iris Dataset

In this paragraph, we will show you how to use dimensionality reduction in Python. Firstly, let’s import the necessary libraries, including Pandas and Numpy for data manipulation, seaborn and matplotlib for data visualization, and sklearn (or scikit-learn) for the important stuff.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE

Secondly, we need to import a dataset. We chose the Iris dataset.

# import the iris dataset
iris_dataset = datasets.load_iris()
X = iris_dataset.data 
y = iris_dataset.target
target_names = iris_dataset.target_names

Thirdly, let’s take a look at the dataset that we will use. We chose Iris dataset because it’s a well-known dataset in machine learning literature. It contains 3 classes, where each class refers to a type of Iris plant.

iris_df = pd.DataFrame(iris_dataset.data, columns = iris_dataset.feature_names)
iris_df['Species']=iris_dataset['target']
iris_df['Species']=iris_df['Species'].apply(lambda x: iris_dataset['target_names'][x])
iris_df.head()
Information about Iris dataset

We can also see how classes are separated regarding different features.

colors = {'Setosa':'#FCEE0C','Versicolor':'#FC8E72','Virginica':'#FC3DC9'}

#Let see how the classes are separated regarding different featueres

sns.FacetGrid(iris_df, hue="Species", height=4, palette=colors.values()) \
   .map(plt.scatter, "sepal length (cm)", "sepal width (cm)") \
   .add_legend()


sns.FacetGrid(iris_df, hue= "Species", height=4, palette=colors.values()).\
map(plt.scatter, "petal length (cm)", "petal width (cm)").add_legend()
plt.show()
Visualization of the Iris dataset considering only two features at the time

A correlation matrix can help us understand the dataset better. It tells us how our four features are correlated. The correlation matrix is easily obtained by using the seaborn library. Here you can check out our tutorial on different plots that you can create with seaborn.

Correlation matrix of Iris dataset

From the correlation matrix, we can notice a high correlation score between features Sepal Length and Sepal Width.

PCA with 2 components

Now, let’s apply PCA with 2 components. This will help us represent our data in two dimensions.

First, we need to normalize the features.

#Use standard scaler to normalize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

After the normalization, we can transform our features using PCA.

pca2 = PCA(n_components=2)
X_r = pca2.fit_transform(X)

for color, i, target_name in zip(colors.values(), [0, 1, 2], target_names):
    plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=.8, 
                label=target_name, s=130, edgecolors='k')
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.xlabel('1st PCA component')
plt.ylabel('2nd PCA component')
plt.title('PCA of IRIS dataset')

# Percentage of variance explained for each components
print('explained variance ratio (first two components): %s' # First two PCA components capture 0.9776852*100% of total variation!
      % str(pca2.explained_variance_ratio_))

plt.show()

PCA with 2 components helped us easily plot our dataset in two dimensions.

PCA with two components helps us to visualize Iris dataset

We can see that Iris Setosa is very different from the other two classes. Also, we can calculate the explained variance. The explained variance will tell us how much of variance do our two components take up.

We got a result of 95.8%, as a total for the first two components. This means that the first two principal components take up 95.8% of the variance. This is a good result and it means that our 2D representation is meaningful. If this score was less than 85%, it would mean that our 2D representation of data might not be valid.

PCA with 3 components

To get a better understanding of the interaction of the features, we can plot the first three PCA components.

fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
pca3 = PCA(n_components=3)

X_reduced = pca3.fit_transform(iris_dataset.data)
ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
           cmap=plt.cm.spring, edgecolor='k', s=130)
ax.set_title("First three PCA components")
ax.set_xlabel("1st PCA component")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd PCA component")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd PCA component")
ax.w_zaxis.set_ticklabels([])

# Percentage of variance explained for each component
print('explained variance ratio (first three components): {}' # First three PCA components capture 0.99478781 of total variation!
      .format(pca3.explained_variance_ratio_))

plt.show()
Iris dataset represented with the first three principal components


LDA with two components

Now let’s calculate the first two LDA components and visualize them. In both PCA and LDA, the Setosa data is well separated from the other two classes. Also, we can see that LDA performs better at keeping the overlap between Versicolor and Virginica to a minimum.

lda = LinearDiscriminantAnalysis(n_components=2)
lda.fit(X, y)
X_r2 = lda.transform(X)
plt.figure(figsize=(10,8))
for color, i, target_name in zip(colors.values(), [0, 1, 2], target_names):
    plt.scatter(X_r2[y == i, 0], X_r2[y == i, 1], alpha=.8, color=color,
                label=target_name,  s=130, edgecolors='k')
plt.legend(loc=3, shadow=False, scatterpoints=1)
plt.xlabel('LDA1')
plt.ylabel('LDA2')           

plt.title('Iris projection onto the first 2 linear discriminants')

print('Explained variance ratio (first two linear discriminants): {}'.format(lda.explained_variance_ratio_))
plt.show()
Iris dataset projected with first two linear discriminants

t-SNE

We will visualize our dataset using t-SNE. We set the dimension of the embedded space to two.

tsne = TSNE(n_components=2, n_iter=1000, random_state=42)
X_tsne = tsne.fit_transform(X)

figure = plt.figure
figure(figsize=(10, 8))

for color, i, target_name in zip(colors.values(), [0, 1, 2], target_names):
    plt.scatter(X_tsne[y == i, 0], X_tsne[y == i, 1], alpha=.8, color=color,
                label=target_name,  s=130, edgecolors='k')
plt.legend(loc='best', shadow=False, scatterpoints=1)
           
plt.title('Iris projection onto the first 2 linear discriminants')
plt.show()
t-SNE projection with 2 dimensions

This is already a significant improvement over the PCA and LDA. As you can see, Iris species form very clear clusters.

Summary

In this post, we covered the fundamental dimensionality reduction techniques in Python using the scikit-learn library. They helped us to reduce the number of dimensions in our original dataset and to visualize our data. We uncovered some hidden relationships between our features.

In the table below we give an overview of the techniques that we explored.

Summary of the dimensionality reduction techniques


We encourage you to further study this topic. All the code from this article you can find in our Github repository. And, in conclusion, we recommend several sources of information:

Vesna Bozovic

Recent Posts

Adding rows to a Pandas Dataframe

While studying Data Science, we often come across DataFrames ready to be used. Normally, those…

6 days ago

How to Install & Import Pandas in Python

Pandas is one of the most powerful libraries for data analysis and is the most…

2 weeks ago

Decision Trees in Scikit-Learn

Introduction The decision tree is a machine learning algorithm which perform both classification and regression.…

3 weeks ago

A Holistic Guide to Groupby Statements in Pandas

The Importance of Groupby Functions In Data Analysis Whether working in SQL, R, Python, or…

4 weeks ago

Logistic Regression in Sci-Kit Learn

Introduction Logistic regression is an important model used in supervised learning. You can use logistic…

1 month ago

Pandas-Profiling, explore your data faster in Python

All datasets have one obvious thing in common, information, but this information is easy and…

1 month ago