Introduction
In this article, we are going to get a detailed explanation of Seaborn Visualizations in Python.
- Seaborn is a statistical plotting library and is built on top of Matplotlib.
- Seaborn has really beautiful default styles.
- Seaborn is designed to work really well with the Pandas dataframe objects.
We’re going to learn how to use Seaborn to plot effectively with Pandas. Seaborn and style go hand in hand.
Installation
In order to get Seaborn installed on your computer, you’re gonna have to use either of the following commands on your command line or terminal:
conda install seaborn
or,
pip install seaborn
Getting Started
First, we start by importing the Seaborn library using the following command:
import seaborn as sns
By convention, we import seaborn as “sns”.
Now let’s get some data to plot. Seaborn actually comes in with a few built-in data sets that you can directly load. The datasets that we will be using in this article are:
- tips : load the tips dataset and save it as a dataframe using the following command:
tips=sns.load_dataset('tips')
We can check the head of this dataframe by typing in:
tips.head()
We get the following output:
There are seven columns here and this is basically just data referring to people who had a meal and then left a tip afterward. So you have the total price or bill of the meal, how much they left as the tip, the gender or sex of the person leaving the tip, whether or not they were a smoker, what day and time they ate their meal out, and then the size of the party.
- flights: This is another built-in dataframe of seaborn and we load it using the following command:
flights = sns.load_dataset('flights')
We can check the first five records of the above dataframe by typing in:
flights.head()
This dataset primarily just shows the number of passengers that flew in a given month of a given year.
Types of Plots
We are going to discuss different plot types with Seaborn.
1. Distribution Plots
The distplot() allows us to show the distribution of a univariate set of observations and univariate is just a different way of saying just one variable. Let’s go ahead and explore this.
sns.distplot(tips['total_bill'])
What we get here, is basically a histogram and the line obtained is known as a KDE, which is short for Kernel Density Estimation. We can remove the KDE and increase the number of bins to get more information by using the following command:
sns.distplot(tips['total_bill'],kde=False,bins=30)
Now we just have a Histogram. A Histogram is essentially just distribution of where the total bill lies. So you can see here, that on the Y-axis you have a count and then you have the bars on the X-axis as bins.
Okay, so it looks like we have a good idea of the information here. Most of the total bills are somewhere between $10 and $20 and begin to fade away as you get higher and higher in Bill Price.
Let’s talk about jointplot(). It allows you to basically match up to distplot() for bivariant data. Meaning, you can essentially combine two different distribution plots. And bivariant are just two variables. Consider the following line of code:
sns.jointplot(x='total_bill',y='tip',data=tips)
Here, we are comparing the distribution of the total bill versus the tip size. The above plot is essentially just two distribution plots. There is the tip on the Y-axis and the total bill on the X-axis. In between, there is a scatter plot. This scatter plot actually makes sense because it looks like it has a trend that as you go higher in the total bill you will go higher in the tip and that makes sense because tips are usually proportionate to your total bill.
jointplot() gives us an additional argument parameter called kind which allows you to affect what’s actually going on inside the jointplot(). The different kinds are listed below:
- “scatter”
- “reg”
- “resid”
- “kde”
- “hex”
sns.jointplot(x='total_bill',y='tip',data=tips,kind='hex')
sns.jointplot(x='total_bill',y='tip',data=tips,kind='reg')
Thus, you can play around with the parameters and get the plot that you want.
Next in the line is the pairplot(). pairplot() is essentially going to plot pairwise relationships across an entire dataframe at least for the numerical columns and it also supports a color hue argument for categorical columns.
sns.pairplot(tips)
Keep in mind, the larger your dataframe, the longer pairplot() takes. So, a lot of times, pairplot() takes a while if you have a very large dataframe. In the above figure, we can see that we have a plot for each numerical column values. So we have size versus total bill, size versus tip, and then when you get to a parameter versus itself, for instance, size versus size, instead of actually doing a scatterplot you see a histogram instead.
We can also add a hue argument to this for categorical column values.
sns.pairplot(tips,hue='sex',palette='coolwarm')
2. Categorical Plots
Now we’re gonna shift our focus to plotting categorical data. For categorical plots, we’re mainly going to be concerned about seeing the distributions of a categorical column.
The most basic categorical plot is the bar plot. Bar plot is just a general plot that allows you to aggregate the categorical data based on some function and by default, that’s the mean.
sns.barplot(x='sex',y='total_bill',data=tips)
You can change the estimator object to your own function, that converts a vector to a scalar:
sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std)
Countplot: It is essentially the same as barplot except the estimator is explicitly counting the number of occurrences. Which is why we only pass the x value:
sns.countplot(x='sex',data=tips)
Next is the Box Plot. A Box Plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.
sns.boxplot(x="day", y="total_bill", data=tips,palette='rainbow')
A Violin Plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. Unlike a box plot, in which all of the plot components correspond to actual data points, the violin plot features a kernel density estimation of the underlying distribution.
sns.violinplot(x="day", y="total_bill", data=tips,palette='rainbow')
3. Matrix Plots
The main kind of Matrix Plot is the Heat Map. In order for a heatmap to work properly, your data should already be in a matrix form, the sns.heatmap function basically just colors it in for you. For example:
# Matrix form for correlation data
tips.corr()
sns.heatmap(tips.corr())
For including annotations, you can try:
sns.heatmap(tips.corr(),cmap='coolwarm',annot=True)
Or for the flights dataframe, we can create a pivot table:
flights.pivot_table(values='passengers',index='month',columns='year')
pvflights = flights.pivot_table(values='passengers',index='month',columns='year')
sns.heatmap(pvflights)
Conclusion
In this article, we dealt with two small example datasets: tips and flights and discussed various plots like:
- Distribution Plots,
- Categorical Plots, and
- Matrix Plots
For further reading, you can refer to:
- https://seaborn.pydata.org/ : The official seaborn documentation
- https://seaborn.pydata.org/examples/index.html : The gallery of seaborn where you can find a wide variety of example plots.
Thanks for reading and hope you like it!