Scatter Plots

Scatter Plots with Regression Lines

Regression is a technique for fitting a model to your data that can help you better understand how variables in your data are related. It can also be used to make predictions. Think of it as moving one step beyond just looking at correlations.

'Simple linear regression' is used to fit a straight line through the data. You use this technique when you think two variables have a simple linear (not curved) relation with one another.

Seaborn includes visualizations of regression lines. Typically these show a basic scatter plot with the regression line added. Seaborn's regression visualization also includes a band around the line indicating the confidence interval.

Note: If you are looking for formal regression statistics for your model, you should be aware that Seaborn's documentation states that their regression functionality is only intended to help with visualization and that formal regression statistics should be created with a different package, such as SciPy.

Let's use regression visualizations to look at the following question: In large population states, is higher property crime associated with higher violent crime rates?

# scatter plot with regression line
sns.lmplot(x = 'Rates.Violent.All',
           y = 'Rates.Property.All',
           data = dfbig)
plt.show()


How about overall?

# scatterplot wtih regression line
sns.lmplot(x = 'Rates.Violent.All',
           y = 'Rates.Property.All',
           data = df_grouped_state_mean)
plt.show()


What about the ten states with the highest levels of violence?

# create list of highest violence states by sorting, then indexing the list,
# and then grabbing the state names from the index
highest_violence = df_grouped_state_mean \
                   .loc[:, 'Rates.Violent.All'] \
                   .sort_values()[-10:] \
                   .index \
                   .to_list()

# use high violence list to index df_grouped using .isin()
dfhighv = df_grouped_state_mean \
          .loc[df_grouped_state_mean \
               .index \
               .isin(highest_violence)]

# plot
sns.lmplot(x = "Population_Mil",
           y = "Rates.Violent.All",
           data = dfhighv)
plt.show()


These visualizations show a single outlier (notice that datapoint in the top left of the graph) that needs to be explored. We will come back to that soon.

Let's look at the association between property crime and population by region. Some Seaborn plotting functions, but not all, take an argument, col =, that allows us to create groups of plots. Earlier, we did this with a loop. Using col and col_wrap=, which specifies how many plots you want per row, makes things a bit easier for us than writing a loop.

# scatter plots with regression lines in colums and rows
sns.lmplot(y = 'Population_Mil',
           x = 'Rates.Property.All',
           col = 'Region', # creates a seperate plot for each region
           col_wrap = 2,  # wraps columns so not all side by side
           data=df_grouped_state_mean)
plt.show()


The columns and rows of plots look nice, but the plots themselves are a bit ugly. Here, the scaling is consistent across the subplots, but most regions don't have data points across the full range of values, so the result is short, stunted-looking, regression lines. Although we will lose the advantage of having common scaling between graphs, we can go back to our loop to make individual plots so Seaborn will scale each plot individually:

# scatter plots with loop
for region in df['Region'].unique(): # loop through the regions
  # create a subset of the data with just one region at a time
  regiondata = df_grouped_state_mean \
               .loc[df_grouped_state_mean['Region'] == region]

  # make the plot
  sns.lmplot(y = 'Population_Mil',
             x = 'Rates.Property.All',
             data = regiondata)
  plt.title(region)
  plt.show()





That's a bit better. Now let's clean it up and provide labels and such.

# same as above, but with title and labels
for region in df['Region'].unique():
  regiondata = df_grouped_state_mean \
              .loc[df_grouped_state_mean['Region'] == region]
  sns.lmplot(y = 'Population_Mil',
             x = 'Rates.Property.All',
             data = regiondata)
  plt.ylabel('Population in Millions')
  plt.xlabel('Rate of Property Crime')
  plt.title(f'Region = {region}')
  plt.show()





This second approach makes for better individual charts to look at the regression lines, but the difference in scaling between them might create problems for folks who don't pay close attention to these things (which is most of us, most of the time). So there is a trade-off for plotting things this way. Be careful with these sorts of scaling issues, and consider your audience when making your visualizations.