Here is an example that combines much of what has been introduced within the course using a very practical application. You should view this step as a culminating project for the first six units of this course. You should master the material in this project before moving on to the units on data mining.
Scatter Plots
Scatter Plots with Regression Lines
Regression is a technique for fitting a model to your data that can help you better understand how variables in your data are related. It can also be used to make predictions. Think of it as moving one step beyond just looking at correlations.
'Simple linear regression' is used to fit a straight line through the data. You use this technique when you think two variables have a simple linear (not curved) relation with one another.
Seaborn includes visualizations of regression lines. Typically these show a basic scatter plot with the regression line added. Seaborn's regression visualization also includes a band around the line indicating the confidence interval.
Note: If you are looking for formal regression statistics for your model, you should be aware that Seaborn's documentation states that their regression functionality is only intended to help with visualization and that formal regression statistics should be created with a different package, such as SciPy.
Let's use regression visualizations to look at the following question: In large population states, is higher property crime associated with higher violent crime rates?
# scatter plot with regression line sns.lmplot(x = 'Rates.Violent.All', y = 'Rates.Property.All', data = dfbig) plt.show()
How about overall?
# scatterplot wtih regression line sns.lmplot(x = 'Rates.Violent.All', y = 'Rates.Property.All', data = df_grouped_state_mean) plt.show()
What about the ten states with the highest levels of violence?
# create list of highest violence states by sorting, then indexing the list, # and then grabbing the state names from the index highest_violence = df_grouped_state_mean \ .loc[:, 'Rates.Violent.All'] \ .sort_values()[-10:] \ .index \ .to_list() # use high violence list to index df_grouped using .isin() dfhighv = df_grouped_state_mean \ .loc[df_grouped_state_mean \ .index \ .isin(highest_violence)] # plot sns.lmplot(x = "Population_Mil", y = "Rates.Violent.All", data = dfhighv) plt.show()
These visualizations show a single outlier (notice that datapoint in the top left of the graph) that needs to be explored. We will come back to that soon.
Let's look at the association between property crime and population by
region. Some Seaborn plotting functions, but not all, take an argument,
col =
, that allows us to create groups of plots. Earlier, we did this
with a loop. Using col
and col_wrap=
, which specifies how many
plots you want per row, makes things a bit easier for us than writing a
loop.
# scatter plots with regression lines in colums and rows sns.lmplot(y = 'Population_Mil', x = 'Rates.Property.All', col = 'Region', # creates a seperate plot for each region col_wrap = 2, # wraps columns so not all side by side data=df_grouped_state_mean) plt.show()
The columns and rows of plots look nice, but the plots themselves are a bit ugly. Here, the scaling is consistent across the subplots, but most regions don't have data points across the full range of values, so the result is short, stunted-looking, regression lines. Although we will lose the advantage of having common scaling between graphs, we can go back to our loop to make individual plots so Seaborn will scale each plot individually:
# scatter plots with loop for region in df['Region'].unique(): # loop through the regions # create a subset of the data with just one region at a time regiondata = df_grouped_state_mean \ .loc[df_grouped_state_mean['Region'] == region] # make the plot sns.lmplot(y = 'Population_Mil', x = 'Rates.Property.All', data = regiondata) plt.title(region) plt.show()
That's a bit better. Now let's clean it up and provide labels and such.
# same as above, but with title and labels for region in df['Region'].unique(): regiondata = df_grouped_state_mean \ .loc[df_grouped_state_mean['Region'] == region] sns.lmplot(y = 'Population_Mil', x = 'Rates.Property.All', data = regiondata) plt.ylabel('Population in Millions') plt.xlabel('Rate of Property Crime') plt.title(f'Region = {region}') plt.show()
This second approach makes for better individual charts to look at the regression lines, but the difference in scaling between them might create problems for folks who don't pay close attention to these things (which is most of us, most of the time). So there is a trade-off for plotting things this way. Be careful with these sorts of scaling issues, and consider your audience when making your visualizations.