Here is an example that combines much of what has been introduced within the course using a very practical application. You should view this step as a culminating project for the first six units of this course. You should master the material in this project before moving on to the units on data mining.
Scatter Plots
Seaborn's .scatterplot()
method can be used to make scatterplots of
data.
# group by state, with mean aggregation df_grouped_state_mean = df.groupby(by='State').mean() # create list of largest population states big_states = df_grouped_state_mean \ .loc[:, 'Population'] \ .sort_values()[-21:] \ .index.to_list() # use big state list to index df_grouped dfbig = df_grouped_state_mean \ .loc[df_grouped_state_mean.index.isin(big_states)] # create scatterplot sns.scatterplot(x = "Population_Mil", y = "Rates.Violent.All", data = dfbig) plt.show()
Among the twenty largest states, there does not appear to be a simple linear association between population and rates of violent crime. It would be nice to know what state each point represents. Let's try adding the hue argument to see how that goes.
# create scatterplot with labelled data points sns.scatterplot(x = 'Population_Mil', y = 'Rates.Violent.All', hue = dfbig.index, data = dfbig) plt.show()
That's an all-new kind of terrible chart that we've made. The States appear in the legend in alphabetical order, but there are so many it's impossible to tell one shade of State from another. Let's try it again with a smaller selection of states.
# create list of largest population states big_six = df_grouped_state_mean \ .loc[:, 'Population'] \ .sort_values()[-8:-1] \ .index \ .to_list() # use big state list to index df_grouped dfbig6 = df_grouped_state_mean \ .loc[df_grouped_state_mean.index.isin(big_six)] # create plot sns.scatterplot(x = 'Population_Mil', y = 'Rates.Violent.All', hue = dfbig6.index, data = dfbig6) plt.show()
An improvement, but still ugly. We can use a matplotlib method to control where the legend is displayed.
# create plot sns.scatterplot(x = 'Population_Mil', y = 'Rates.Violent.All', hue = dfbig6.index, data = dfbig6) # use a matplotlib method to control the legend display plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.) plt.show()
That's a bit better. Now that the plot is cleaned up, let's look at what it says about our States.
Unfortunately, Florida shows that smaller populations (relative to NY and TX) can have violent crime rates that match or exceed bigger states.
We can examine the same association by region.
# add the state supplemental information into the grouped state mean df df_grouped_state_mean = pd.merge(df_grouped_state_mean.reset_index(), df_stateinfo) # plot the association with Region specified as the hue sns.scatterplot(x = 'Population_Mil', y = 'Rates.Violent.All', hue = 'Region', data = df_grouped_state_mean) plt.show()
That's not a particularly good visualization. Too many brightly colored dots all next to one another.
Let's try breaking up the plots by region.
for region in df_grouped_state_mean.loc[:, 'Region'].unique(): sns.scatterplot(x = 'Population_Mil', y = 'Rates.Violent.All', data = df_grouped_state_mean \ .loc[df_grouped_state_mean['Region'] == region]) plt.title(region) plt.show()
So there does appear to be a positive association between population size and rates of violence in the Northeast and Midwest.
The plots for the South and the West both have outliers (data points that appear to be quite distant from the others). In the South, we have a relatively small state with a very high rate of violent crime. (It's not actually a state, despite having a population larger than several states, but we will come back to that). In the West, we have a very large state (California), with a rate of crime very similar to much less populous states.