Data Visualization in Python

At this point in the course, it is time to begin connecting the dots and applying visualization to your knowledge of statistics. Work through these programming examples to round out your knowledge of seaborn as it is applied to univariate and bivariate plots.

Bivariate Plots

pandas

Scatter plot

diamonds = pd.read_csv('data/diamonds.csv.gz')
diamonds.plot(x = 'carat', y = 'price', kind = 'scatter');
plt.show()

Box plot

diamonds.boxplot(column = 'price', by = 'color');
plt.show()

seaborn

Scatter plot

sns.regplot(data = diamonds, x = 'carat', y = 'price', fit_reg=False);
plt.show()

sns.scatterplot(data=diamonds, x = 'carat', y = 'price', linewidth=0); 
# We set the linewidth to 0, otherwise the lines around the circles
# appear white and wash out the figure. Try with any positive 
# value of linewidth

Box plot

ordered_color = ['E','F','G','H','I','J']
sns.catplot(data = diamonds, x = 'color', y = 'price', 
            order = ordered_color, color = 'blue', kind = 'box');
plt.show()

Violin plot

g = sns.catplot(data = diamonds, x = 'color', y = 'price', 
                kind = 'violin', order = ordered_color);
plt.show()

Barplot (categorical vs continuous)

ordered_colors = ['D','E','F','G','H','I']
sns.barplot(data = diamonds, x = 'color', y = 'price', order = ordered_colors);
plt.show()

sns.barplot(data = diamonds, x = 'cut', y = 'price');
plt.show()

Joint plot

sns.jointplot(data = diamonds, x = 'carat', y = 'price');
plt.show()

sns.jointplot(data = diamonds, x = 'carat', y = 'price', kind = 'reg');
plt.show()

sns.jointplot(data = diamonds, x = 'carat', y = 'price', kind = 'hex');
plt.show()

## Facets and multivariate data

The basic idea in this section is to see how we can visualize more than two variables at a time. We will see two strategies:

Put multiple graphs on the same frame, with each graph referring to a level of a 3rd variable
Create a grid of separate graphs, with each graph referring to a level of a 3rd variable

This strategy also can work any time we need to visualize the data corresponding to different levels of a variable, say by gender, race, or country.

In this example, we're going to start with 4 time series, labelled A, B, C, D.

ts = pd.read_csv('data/ts.csv')
ts.dt = pd.to_datetime(ts.dt) # convert this column to a datetime object
ts.head()

          dt kind     value
0 2000-01-01    A  1.442521
1 2000-01-02    A  1.981290
2 2000-01-03    A  1.586494
3 2000-01-04    A  1.378969
4 2000-01-05    A -0.277937

For one strategy we will employ, it is actually a bit easier to change this to a wide data form, using pivot .

dfp = ts.pivot(index = 'dt', columns = 'kind', values = 'value')
dfp.head()

kind               A         B         C         D
dt                                                
2000-01-01  1.442521  1.808741  0.437415  0.096980
2000-01-02  1.981290  2.277020  0.706127 -1.523108
2000-01-03  1.586494  3.474392  1.358063 -3.100735
2000-01-04  1.378969  2.906132  0.262223 -2.660599
2000-01-05 -0.277937  3.489553  0.796743 -3.417402

fig, ax = plt.subplots()
dfp.plot(ax=ax);
plt.show()

This creates 4 separate time series plots, one for each of the columns labeled A, B, C, and D. The x-axis is determined by dfp.index , which during the pivoting operation, we deemed was the values of dt in the original data.

Using seaborn …

sns.lineplot(data = dfp);
plt.show()

However, we can achieve this same plot using the original data, and seaborn , in rather short order

sns.lineplot(data = ts, x = 'dt', y = 'value', hue = 'kind');
plt.show()

In this plot, assigning a variable to hue tells seaborn to draw lines (in this case) of different hues based on values of that variable.

We can use a bit more granular and explicit code for this as well. This allows us a bit more control of the plot.

g = sns.FacetGrid(ts, hue = 'kind', height = 5, aspect = 1.5)
g.map(plt.plot, 'dt', 'value').add_legend()

<seaborn.axisgrid.FacetGrid object at 0x1319093d0>

g.ax.set(xlabel = 'Date',
        ylabel = 'Value',
        title = 'Time series');
plt.show()

## All of this code chunk needs to be run at one time, otherwise you get weird errors. This
## is true for many plotting commands which are composed of multiple commands.

The FacetGrid tells seaborn that we're going to layer graphs, with layers based on hue and the hues being determined by values of kind . Notice that we can add a few more details like the aspect ratio of the plot and so on. The documentation for FacetGrid, which we will also use for facets below, may be helpful in finding all the options you can control.

We can also show more than one kind of layer on a single graph

fmri = sns.load_dataset('fmri')

plt.style.use('seaborn-notebook')
sns.relplot(x = 'timepoint', y = 'signal', data = fmri);
plt.show()

sns.relplot(x = 'timepoint', y = 'signal', data = fmri, kind = 'line');
plt.show()

sns.relplot(x = 'timepoint', y = 'signal', data = fmri, kind = 'line', hue ='event');
plt.show()

sns.relplot(x = 'timepoint', y = 'signal', data = fmri, hue = 'region', 
            style = 'event', kind = 'line');
plt.show()

Here we use color to show the region, and line style (solid vs dashed) to show the event.

Scatter plots by group

g = sns.FacetGrid(diamonds, hue = 'color', height = 7.5)
g.map(plt.scatter, 'carat', 'price').add_legend();
plt.show()

Notice that this arranges the colors and values for the color variable in random order. If we have a preferred order we can impose that using the option hue_order .

clarity_ranking = ["I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"]
sns.scatterplot(x="carat", y="price",
                hue="clarity", size="depth",
                hue_order=clarity_ranking,
                sizes=(1, 8), linewidth=0,
                data=diamonds);
plt.show()

Facets

Facets or trellis graphics is a visualization method where we draw multiple plots in a grid, with each plot corresponding to unique values of a particular variable or combinations of variables. This has also been called small multiples.

We'll proceed with an example using the iris dataset.

iris = pd.read_csv('data/iris.csv')
iris.head()

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

g = sns.FacetGrid(iris, col = 'species', hue = 'species', height = 5)
g.map(plt.scatter, 'sepal_width', 'sepal_length').add_legend();
plt.show()

Here we use FacetGrid to indicate that we're creating multiple subplots by specifying the option col (for column). So this code says we are going to create one plot per level of species, arranged as separate columns (or, in effect, along one row). You could also specify row which would arrange the plots one to a row or, in effect, in one column.

The map function says, take the facets I've defined and stored in g, and in each one, plot a scatter plot with sepal_width on the x-axis and sepal_length on the y-axis.

We could also use relplot for a more compact solution.

sns.relplot(x = 'sepal_width', y = 'sepal_length', data = iris, 
            col = 'species', hue = 'species');
plt.show()

A bit more of a complicated example, using the fmri data, where we're coloring lines based on the subject and creating a 2-d grid, where the region of the brain is along columns, and event type is along rows.

sns.relplot(x="timepoint", y="signal", hue="subject",
            col="region", row="event", height=3,
            kind="line", estimator=None, data=fmri);

/Users/abhijit/opt/miniconda3/envs/ds/lib/python3.8/site-packages/seaborn/axisgrid.py:324: 
RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface 
(`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. 
(To control this warning, see the rcParam `figure.max_open_warning`).
  fig, axes = plt.subplots(nrow, ncol, **kwargs)

plt.show()

In the following example, we want to show how each subject fares for each of the two events, just within the frontal region. We let seaborn figure out the layout, only specifying that we'll be going along rows ("by column") and also saying we'll wrap around to the beginning once we've got to 5 columns. Note we use the query function to filter the dataset.

sns.relplot(x="timepoint", y="signal", hue="event", style="event",
            col="subject", col_wrap=5,
            height=3, aspect=.75, linewidth=2.5,
            kind="line", data=fmri.query("region == 'frontal'"));

/Users/abhijit/opt/miniconda3/envs/ds/lib/python3.8/site-packages/seaborn/axisgrid.py:333: 
RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface
(`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory.
(To control this warning, see the rcParam `figure.max_open_warning`).
  fig = plt.figure(figsize=figsize)

plt.show()

In the following example, we want to compare the distribution of price from the diamonds dataset by color, so it makes sense to create density plots of the price distribution and stack them one below the next so we can visually compare them.

ordered_colors = ['E','F','G','H','I','J']
g = sns.FacetGrid(data = diamonds, row = 'color', height = 1.7, 
                  aspect = 4, row_order = ordered_colors)

/Users/abhijit/opt/miniconda3/envs/ds/lib/python3.8/site-packages/seaborn/axisgrid.py:324:
RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface
(`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory.
(To control this warning, see the rcParam `figure.max_open_warning`).
  fig, axes = plt.subplots(nrow, ncol, **kwargs)

g.map(sns.kdeplot, 'price');
plt.show()

You need to use FacetGrid to create sets of univariate plots since there is no particular method that allows univariate plots over a grid like relplot for bivariate plots.

Pairs plots

The pairs plot is a quick way to compare every pair of variables in a dataset (or at least every pair of continuous variables) in a grid. You can specify what kind of univariate plot will be displayed on the diagonal locations on the grid and which bivariate plots will be displayed on the off-diagonal locations.

sns.pairplot(data=iris);

/Users/abhijit/opt/miniconda3/envs/ds/lib/python3.8/site-packages/seaborn/axisgrid.py:1292:
RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface
(`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory.
(To control this warning, see the rcParam `figure.max_open_warning`).
  fig, axes = plt.subplots(len(y_vars), len(x_vars),

plt.show()

You can achieve more customization using PairGrid .

g = sns.PairGrid(iris, diag_sharey=False);
g.map_upper(sns.scatterplot);
g.map_lower(sns.kdeplot, colors="C0");
g.map_diag(sns.kdeplot, lw=2);
plt.show()