Visualizing with seaborn

Here is an example that combines much of what has been introduced within the course using a very practical application. You should view this step as a culminating project for the first six units of this course. You should master the material in this project before moving on to the units on data mining.

Introduction

There are a variety of data visualization libraries available in Python. There is a lot of commonality in these libraries, but they do take different approaches and offer somewhat different visualization tools.

A library called Matplotlib was developed in 2002. Matplotlib has been designed to work with Numpy and Scipy. Matplotlib underlies many Python visualization packages, including the one we will learn called Seaborn. Seaborn is an advanced library; it is powerful and straightforward, which makes it a good place to start. However, if you continue to work with these kinds of tools, consider learning Matplotlib as well.

By convention, Seaborn is imported and given the abbreviation sns. When you see a call that is prefaced with sns, such as sns.lineplot(), you are using the Seaborn library.

Here are some image galleries you can take a quick look at so you can get a sense of what these packages can do: * Seaborn gallery * Matplotlib gallery * Plotly gallery

We will only be scratching the surface of Seaborn, but once you get used to the basics, you should be able to start learning the rest for yourself.

We will be using calls to Matplotlib to tweak and display the plotting object we have built with Seaborn. In particular, we will be using a module within the library called matplotlib.pyplot. By convention, matplotlib.pyplot is imported as plot. When you see a function call prefaced with plot, such as plot.show(), you are using the Matplotlib library.


Preliminaries: Imports and Dataframe Creation

We are going to be using the CORGIS state crime data set. Each row in the dataset represents one year and one U.S. state.

# imports
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# ignore this code entirely
%matplotlib inline
%config InlineBackend.figure_formats = ['svg']

# read the data and create the dataframe
urlg = 'https://raw.githubusercontent.com/'
repo = 'bsheese/CSDS125ExampleData/master/'
fnme1 = 'data_corgis_state_crime.csv'
df_original = pd.read_csv(urlg + repo + fnme1)
df = df_original[df_original['State'] != 'United States']

# read the data and create a supplemental dataframe
fnme2 = 'data_state_info.csv'
df_stateinfo = pd.read_csv(urlg + repo + fnme2)

# remove totals and just keep crime rates
column_mask = ~df.columns.str.contains('Totals')
df = df[df.columns[column_mask]]

#drop rows with empty values
df = df.dropna(axis=0, how='any')

# create a decade column
df.loc[:, 'Decade'] = (df.loc[:, 'Year']//10 * 10).astype(int)

# merge state crime data with supplemental state classification data
df_stateinfo = df_stateinfo.reset_index()
df = pd.merge(df, df_stateinfo)

# normalize population for better plotting
df.loc[:, 'Population_Mil'] = df.loc[:, 'Population']/1000000

#check dataframe
df.head(3)


0  1 2
Population 3266740  3302000 3358000
Rates.Property.All 1035.4  985.5 1067.0
Rates.Property.Burglary 355.9  339.3 349.1
Rates.Property.Larceny 592.1  569.4 634.5
Rates.Property.Motor 87.3  76.8 83.4
Rates.Violent.All 186.6  168.5 157.3
Rates.Violent.Assault 138.1  128.9 119.0
Rates.Violent.Murder 12.4  12.9 9.4
Rates.Violent.Rape 8.6  7.6 6.5
Rates.Violent.Robbery 27.5  19.1 22.5
State Alabama  Alabama Alabama
Year 1960  1961 1962
Decade 1960  1960 1960
index 1  1 1
State Code AL  AL AL
Region South  South South
Division East South Central  East South Central East South Central
Population_Mil 3.26674  3.30200 3.35800

# abbreviated decriptives
df.describe().T[['mean', 'min', 'max']].round(1)

mean min max
Population 4751877.0 226167.0 38041430.0
Rates.Property.All 3683.2 573.1 9512.1
Rates.Property.Burglary 929.5 182.6 2906.7
Rates.Property.Larceny 2395.2 293.3 5833.8
Rates.Property.Motor 358.5 48.3 1839.9
Rates.Violent.All 398.9 9.5 2921.8
Rates.Violent.Assault 235.6 3.6 1557.6
Rates.Violent.Murder 6.7 0.2 80.6
Rates.Violent.Rape 28.3 0.8 102.2
Rates.Violent.Robbery 128.2 1.9 1635.1
Year 1986.0 1960.0 2012.0
Decade 1981.7 1960.0 2010.0
index 25.0 0.0 50.0
Population_Mil 4.8 0.2 38.0


Source: Mark Liffiton and Brad Sheese, https://snakebear.science/09-DataVisualization/index.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License.