Scatter Plots

Seaborn's .scatterplot() method can be used to make scatterplots of data.

# group by state, with mean aggregation
df_grouped_state_mean = df.groupby(by='State').mean()

# create list of largest population states
big_states = df_grouped_state_mean \
             .loc[:, 'Population'] \
             .sort_values()[-21:] \
             .index.to_list()

# use big state list to index df_grouped
dfbig = df_grouped_state_mean \
        .loc[df_grouped_state_mean.index.isin(big_states)]

# create scatterplot
sns.scatterplot(x = "Population_Mil",
                y = "Rates.Violent.All",
                data = dfbig)
plt.show()


Among the twenty largest states, there does not appear to be a simple linear association between population and rates of violent crime. It would be nice to know what state each point represents. Let's try adding the hue argument to see how that goes.

# create scatterplot with labelled data points
sns.scatterplot(x = 'Population_Mil',
                y = 'Rates.Violent.All',
                hue = dfbig.index,
                data = dfbig)
plt.show()


That's an all-new kind of terrible chart that we've made. The States appear in the legend in alphabetical order, but there are so many it's impossible to tell one shade of State from another. Let's try it again with a smaller selection of states.

# create list of largest population states
big_six = df_grouped_state_mean  \
          .loc[:, 'Population'] \
          .sort_values()[-8:-1] \
          .index \
          .to_list()

# use big state list to index df_grouped
dfbig6 = df_grouped_state_mean \
         .loc[df_grouped_state_mean.index.isin(big_six)]

# create plot
sns.scatterplot(x = 'Population_Mil',
                y = 'Rates.Violent.All',
                hue = dfbig6.index,
                data = dfbig6)
plt.show()


An improvement, but still ugly. We can use a matplotlib method to control where the legend is displayed.

# create plot
sns.scatterplot(x = 'Population_Mil',
                y = 'Rates.Violent.All',
                hue = dfbig6.index,
                data = dfbig6)

# use a matplotlib method to control the legend display
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()


That's a bit better. Now that the plot is cleaned up, let's look at what it says about our States.

Unfortunately, Florida shows that smaller populations (relative to NY and TX) can have violent crime rates that match or exceed bigger states.

We can examine the same association by region.

# add the state supplemental information into the grouped state mean df
df_grouped_state_mean = pd.merge(df_grouped_state_mean.reset_index(),
                                 df_stateinfo)

# plot the association with Region specified as the hue
sns.scatterplot(x = 'Population_Mil',
                y = 'Rates.Violent.All',
                hue = 'Region',
                data = df_grouped_state_mean)
plt.show()


That's not a particularly good visualization. Too many brightly colored dots all next to one another.

Let's try breaking up the plots by region.

for region in df_grouped_state_mean.loc[:, 'Region'].unique():
  sns.scatterplot(x = 'Population_Mil',
                  y = 'Rates.Violent.All',
                  data = df_grouped_state_mean \
                  .loc[df_grouped_state_mean['Region'] == region])
  plt.title(region)
  plt.show()






So there does appear to be a positive association between population size and rates of violence in the Northeast and Midwest.

The plots for the South and the West both have outliers (data points that appear to be quite distant from the others). In the South, we have a relatively small state with a very high rate of violent crime. (It's not actually a state, despite having a population larger than several states, but we will come back to that). In the West, we have a very large state (California), with a rate of crime very similar to much less populous states.