Here is an example that combines much of what has been introduced within the course using a very practical application. You should view this step as a culminating project for the first six units of this course. You should master the material in this project before moving on to the units on data mining.
Introduction
There are a variety of data visualization libraries available in Python. There is a lot of commonality in these libraries, but they do take different approaches and offer somewhat different visualization tools.
A library called Matplotlib was developed in 2002. Matplotlib has been designed to work with Numpy and Scipy. Matplotlib underlies many Python visualization packages, including the one we will learn called Seaborn. Seaborn is an advanced library; it is powerful and straightforward, which makes it a good place to start. However, if you continue to work with these kinds of tools, consider learning Matplotlib as well.
By convention, Seaborn is imported and given the abbreviation sns
.
When you see a call that is prefaced with sns
, such as
sns.lineplot()
, you are using the Seaborn library.
Here are some image galleries you can take a quick look at so you can get a sense of what these packages can do: * Seaborn gallery * Matplotlib gallery * Plotly gallery
We will only be scratching the surface of Seaborn, but once you get used to the basics, you should be able to start learning the rest for yourself.
We will be using calls to Matplotlib to tweak and display the plotting
object we have built with Seaborn. In particular, we will be using
a module within the library called matplotlib.pyplot
. By convention,
matplotlib.pyplot
is imported as plot
. When you see a function
call prefaced with plot
, such as plot.show()
, you are using the
Matplotlib library.
Preliminaries: Imports and Dataframe Creation
We are going to be using the CORGIS state crime data set. Each row in the dataset represents one year and one U.S. state.
# imports import matplotlib.pyplot as plt import seaborn as sns import pandas as pd # ignore this code entirely %matplotlib inline %config InlineBackend.figure_formats = ['svg'] # read the data and create the dataframe urlg = 'https://raw.githubusercontent.com/' repo = 'bsheese/CSDS125ExampleData/master/' fnme1 = 'data_corgis_state_crime.csv' df_original = pd.read_csv(urlg + repo + fnme1) df = df_original[df_original['State'] != 'United States'] # read the data and create a supplemental dataframe fnme2 = 'data_state_info.csv' df_stateinfo = pd.read_csv(urlg + repo + fnme2) # remove totals and just keep crime rates column_mask = ~df.columns.str.contains('Totals') df = df[df.columns[column_mask]] #drop rows with empty values df = df.dropna(axis=0, how='any') # create a decade column df.loc[:, 'Decade'] = (df.loc[:, 'Year']//10 * 10).astype(int) # merge state crime data with supplemental state classification data df_stateinfo = df_stateinfo.reset_index() df = pd.merge(df, df_stateinfo) # normalize population for better plotting df.loc[:, 'Population_Mil'] = df.loc[:, 'Population']/1000000 #check dataframe df.head(3)
0 | 1 | 2 | |
---|---|---|---|
Population | 3266740 | 3302000 | 3358000 |
Rates.Property.All | 1035.4 | 985.5 | 1067.0 |
Rates.Property.Burglary | 355.9 | 339.3 | 349.1 |
Rates.Property.Larceny | 592.1 | 569.4 | 634.5 |
Rates.Property.Motor | 87.3 | 76.8 | 83.4 |
Rates.Violent.All | 186.6 | 168.5 | 157.3 |
Rates.Violent.Assault | 138.1 | 128.9 | 119.0 |
Rates.Violent.Murder | 12.4 | 12.9 | 9.4 |
Rates.Violent.Rape | 8.6 | 7.6 | 6.5 |
Rates.Violent.Robbery | 27.5 | 19.1 | 22.5 |
State | Alabama | Alabama | Alabama |
Year | 1960 | 1961 | 1962 |
Decade | 1960 | 1960 | 1960 |
index | 1 | 1 | 1 |
State Code | AL | AL | AL |
Region | South | South | South |
Division | East South Central | East South Central | East South Central |
Population_Mil | 3.26674 | 3.30200 | 3.35800 |
# abbreviated decriptives df.describe().T[['mean', 'min', 'max']].round(1)
mean | min | max | |
---|---|---|---|
Population | 4751877.0 | 226167.0 | 38041430.0 |
Rates.Property.All | 3683.2 | 573.1 | 9512.1 |
Rates.Property.Burglary | 929.5 | 182.6 | 2906.7 |
Rates.Property.Larceny | 2395.2 | 293.3 | 5833.8 |
Rates.Property.Motor | 358.5 | 48.3 | 1839.9 |
Rates.Violent.All | 398.9 | 9.5 | 2921.8 |
Rates.Violent.Assault | 235.6 | 3.6 | 1557.6 |
Rates.Violent.Murder | 6.7 | 0.2 | 80.6 |
Rates.Violent.Rape | 28.3 | 0.8 | 102.2 |
Rates.Violent.Robbery | 128.2 | 1.9 | 1635.1 |
Year | 1986.0 | 1960.0 | 2012.0 |
Decade | 1981.7 | 1960.0 | 2010.0 |
index | 25.0 | 0.0 | 50.0 |
Population_Mil | 4.8 | 0.2 | 38.0 |
Source: Mark Liffiton and Brad Sheese, https://snakebear.science/09-DataVisualization/index.html This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.