At this point in the course, it is time to begin connecting the dots and applying visualization to your knowledge of statistics. Work through these programming examples to round out your knowledge of seaborn as it is applied to univariate and bivariate plots.
Bivariate Plots
pandas
Scatter plot
diamonds = pd.read_csv('data/diamonds.csv.gz')
diamonds.plot(x = 'carat', y = 'price', kind = 'scatter');
plt.show()

Box plot
diamonds.boxplot(column = 'price', by = 'color');
plt.show()

seaborn
Scatter plot
sns.regplot(data = diamonds, x = 'carat', y = 'price', fit_reg=False);
plt.show()

sns.scatterplot(data=diamonds, x = 'carat', y = 'price', linewidth=0);
# We set the linewidth to 0, otherwise the lines around the circles
# appear white and wash out the figure. Try with any positive
# value of linewidth

Box plot
ordered_color = ['E','F','G','H','I','J']
sns.catplot(data = diamonds, x = 'color', y = 'price',
order = ordered_color, color = 'blue', kind = 'box');
plt.show()

Violin plot
g = sns.catplot(data = diamonds, x = 'color', y = 'price',
kind = 'violin', order = ordered_color);
plt.show()

Barplot (categorical vs continuous)
ordered_colors = ['D','E','F','G','H','I']
sns.barplot(data = diamonds, x = 'color', y = 'price', order = ordered_colors);
plt.show()

sns.barplot(data = diamonds, x = 'cut', y = 'price');
plt.show()

Joint plot
sns.jointplot(data = diamonds, x = 'carat', y = 'price');
plt.show()

sns.jointplot(data = diamonds, x = 'carat', y = 'price', kind = 'reg');
plt.show()

sns.jointplot(data = diamonds, x = 'carat', y = 'price', kind = 'hex');
plt.show()

## Facets and multivariate data
The basic idea in this section is to see how we can visualize more than two variables at a time. We will see two strategies:
- Put multiple graphs on the same frame, with each graph referring to a level of a 3rd variable
- Create a grid of separate graphs, with each graph referring to a level of a 3rd variable
This strategy also can work any time we need to visualize the data corresponding to different levels of a variable, say by gender, race, or country.
In this example, we're going to start with 4 time series, labelled A, B, C, D.
ts = pd.read_csv('data/ts.csv')
ts.dt = pd.to_datetime(ts.dt) # convert this column to a datetime object
ts.head()
dt kind value
0 2000-01-01 A 1.442521
1 2000-01-02 A 1.981290
2 2000-01-03 A 1.586494
3 2000-01-04 A 1.378969
4 2000-01-05 A -0.277937
For one strategy we will employ, it is actually a bit easier to change this to a wide data form, using
pivot
.dfp = ts.pivot(index = 'dt', columns = 'kind', values = 'value')
dfp.head()
kind A B C D
dt
2000-01-01 1.442521 1.808741 0.437415 0.096980
2000-01-02 1.981290 2.277020 0.706127 -1.523108
2000-01-03 1.586494 3.474392 1.358063 -3.100735
2000-01-04 1.378969 2.906132 0.262223 -2.660599
2000-01-05 -0.277937 3.489553 0.796743 -3.417402
fig, ax = plt.subplots()
dfp.plot(ax=ax);
plt.show()

This creates 4 separate time series plots, one for each of the columns labeled A, B, C, and D. The x-axis is determined by dfp.index
, which during the pivoting operation, we deemed was the values of dt
in the original data.
Using seaborn
…
sns.lineplot(data = dfp);
plt.show()

However, we can achieve this same plot using the original data, and seaborn
, in rather short order
sns.lineplot(data = ts, x = 'dt', y = 'value', hue = 'kind');
plt.show()

In this plot, assigning a variable to hue
tells seaborn to draw lines (in this case) of different hues based on values of that variable.
We can use a bit more granular and explicit code for this as well. This allows us a bit more control of the plot.
g = sns.FacetGrid(ts, hue = 'kind', height = 5, aspect = 1.5)
g.map(plt.plot, 'dt', 'value').add_legend()
<seaborn.axisgrid.FacetGrid object at 0x1319093d0>
g.ax.set(xlabel = 'Date',
ylabel = 'Value',
title = 'Time series');
plt.show()
## All of this code chunk needs to be run at one time, otherwise you get weird errors. This
## is true for many plotting commands which are composed of multiple commands.

The FacetGrid
tells seaborn
that we're going to layer graphs, with layers based on hue
and the hues being determined by values of kind
. Notice that we can add a few more details like the aspect ratio of the plot and so on. The documentation for FacetGrid, which we will also use for facets below, may be helpful in finding all the options you can control.
We can also show more than one kind of layer on a single graph
fmri = sns.load_dataset('fmri')
plt.style.use('seaborn-notebook')
sns.relplot(x = 'timepoint', y = 'signal', data = fmri);
plt.show()

sns.relplot(x = 'timepoint', y = 'signal', data = fmri, kind = 'line');
plt.show()

sns.relplot(x = 'timepoint', y = 'signal', data = fmri, kind = 'line', hue ='event');
plt.show()

sns.relplot(x = 'timepoint', y = 'signal', data = fmri, hue = 'region',
style = 'event', kind = 'line');
plt.show()

Here we use color to show the region, and line style (solid vs dashed) to show the event.
Scatter plots by group
g = sns.FacetGrid(diamonds, hue = 'color', height = 7.5)
g.map(plt.scatter, 'carat', 'price').add_legend();
plt.show()

Notice that this arranges the colors and values for the color
variable in random order. If we have a preferred order we can impose that using the option hue_order
.
clarity_ranking = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]
sns.scatterplot(x="carat", y="price",
hue="clarity", size="depth",
hue_order=clarity_ranking,
sizes=(1, 8), linewidth=0,
data=diamonds);
plt.show()

Facets
Facets or trellis graphics is a visualization method where we draw multiple plots in a grid, with each plot corresponding to unique values of a particular variable or combinations of variables. This has also been called small multiples.
We'll proceed with an example using the iris
dataset.
iris = pd.read_csv('data/iris.csv')
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
g = sns.FacetGrid(iris, col = 'species', hue = 'species', height = 5)
g.map(plt.scatter, 'sepal_width', 'sepal_length').add_legend();
plt.show()

Here we use FacetGrid
to indicate that we're creating multiple subplots by specifying the option col
(for column). So this code says we are going to create one plot per level of species, arranged as separate columns (or, in effect, along one row). You could also specify row
which would arrange the plots one to a row or, in effect, in one column.
The map
function says, take the facets I've defined and stored in g
, and in each one, plot a scatter plot with sepal_width
on the x-axis and sepal_length
on the y-axis.
We could also use relplot
for a more compact solution.
sns.relplot(x = 'sepal_width', y = 'sepal_length', data = iris,
col = 'species', hue = 'species');
plt.show()

A bit more of a complicated example, using the fmri
data, where we're coloring lines based on the subject and creating a 2-d grid, where the region of the brain is along columns, and event type is along rows.
sns.relplot(x="timepoint", y="signal", hue="subject",
col="region", row="event", height=3,
kind="line", estimator=None, data=fmri);
/Users/abhijit/opt/miniconda3/envs/ds/lib/python3.8/site-packages/seaborn/axisgrid.py:324: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
fig, axes = plt.subplots(nrow, ncol, **kwargs)
plt.show()

In the following example, we want to show how each subject fares for each of the two events, just within the frontal region. We let seaborn
figure out the layout, only specifying that we'll be going along rows ("by column") and also saying we'll wrap around to the beginning once we've got to 5 columns. Note we use the query
function to filter the dataset.
sns.relplot(x="timepoint", y="signal", hue="event", style="event",
col="subject", col_wrap=5,
height=3, aspect=.75, linewidth=2.5,
kind="line", data=fmri.query("region == 'frontal'"));
/Users/abhijit/opt/miniconda3/envs/ds/lib/python3.8/site-packages/seaborn/axisgrid.py:333: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
fig = plt.figure(figsize=figsize)
plt.show()

In the following example, we want to compare the distribution of price from the diamonds dataset by color, so it makes sense to create density plots of the price distribution and stack them one below the next so we can visually compare them.
ordered_colors = ['E','F','G','H','I','J']
g = sns.FacetGrid(data = diamonds, row = 'color', height = 1.7,
aspect = 4, row_order = ordered_colors)
/Users/abhijit/opt/miniconda3/envs/ds/lib/python3.8/site-packages/seaborn/axisgrid.py:324: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
fig, axes = plt.subplots(nrow, ncol, **kwargs)
g.map(sns.kdeplot, 'price');
plt.show()

You need to use FacetGrid
to create sets of univariate plots since there is no particular method that allows univariate plots over a grid like relplot
for bivariate plots.
Pairs plots
The pairs plot is a quick way to compare every pair of variables in a dataset (or at least every pair of continuous variables) in a grid. You can specify what kind of univariate plot will be displayed on the diagonal locations on the grid and which bivariate plots will be displayed on the off-diagonal locations.
sns.pairplot(data=iris);
/Users/abhijit/opt/miniconda3/envs/ds/lib/python3.8/site-packages/seaborn/axisgrid.py:1292: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
fig, axes = plt.subplots(len(y_vars), len(x_vars),
plt.show()

You can achieve more customization using PairGrid
.
g = sns.PairGrid(iris, diag_sharey=False);
g.map_upper(sns.scatterplot);
g.map_lower(sns.kdeplot, colors="C0");
g.map_diag(sns.kdeplot, lw=2);
plt.show()