CS250 Study Guide

Unit 6: Visualization

6a. Apply seaborn commands to visualize pandas dataframe data

  • What is the seaborn module, and how is it used?
  • What are some useful plotting categories to be aware of?
  • What are important input parameter choices to be aware of?

The seaborn module is an enhanced visualization package designed for data science applications. It goes beyond the capabilities of matplotlib visualization where, for example, the set_theme method can augment the matplotlib plotting environment. You can use the seaborn module to render data in many ways by creating relational plots, categorical plots, distribution plots, multi-plots, matrix plots, and so on.

Seaborn has conveniently wrapped its plotting routines into specific categories. For example, a relational plot can be expressed using a line plot or a scatter plot. You can call the associated relational plotting methods individually, or they can be configured using the relplot method with input parameters such as kind set to their desired values. Distribution plots are used to plot histograms, marginal distributions, empirical cumulative distributions, and kernel density estimates, either by calling them individually or configuring the displot method. When applying the histplot method, you should be familiar with input parameters such as multiple for plotting multiple histograms and kde for superimposing a kernel density estimate. Category plots are useful for creating bar plots, violin plots, box plots, swarm plots, and count plots, and you should be familiar with the syntax for creating these plots using the catplot method. You should also be familiar with the default mode of these methods.

Finally, you should be aware of the subtle differences and applications when creating a pairplot versus using FacetGrid or displot. An input parameter such as margin_titles can be applied along with FacetGrid, which can be useful for visualizing conditional histograms of three-dimensional data in two dimensions. The pairplot method only considers comparing two variables and does not have a margin_titles option.

To review, see The seaborn Module.

 

6b. Apply advanced data visualization techniques 

  • How is the estimator chosen for statistical visualization techniques?
  • What parameter can be used to specify and visualize the confidence interval?
  • What options are available to stratify and order categorical data?

As you immerse yourself more deeply into the seaborn module, you will find that many parameters are available to fine-tune your visualizations and use more advanced techniques for controlling statistical measures. For example, methods such as barplot allow you to choose the estimator where the default function is the mean; however, an equally valid choice might be the median. Many seaborn methods allow you to include the confidence interval by setting the ci parameter. Adding this aspect of statistical visualization to line plots or modifying the ci parameter in bar plots can reveal much about a given data set. Additionally, by this point, you should have a strong command of the input parameters for the displot method. You should also be aware that the Pairgrid method allows you to control the diagonal scaling using the diag_sharey input parameter to balance the scale of the histogram heights.

The seaborn module offers a great deal of flexibility when it comes to rendering categorical data. Categorical plots will stratify the categories of a specific variable by setting the hue parameter. It is also important that you understand the goal of a box plot versus, for example, a violin plot to visualize quartile ranges, outliers, and kernel density estimates. The use of countplot should also be a part of your inventory for rendering categorical data. You should have a strong command of this method's input parameters. For example, you can control the ordering of the categories using the order parameter. The syntax of plot category parameters is quite consistent among various methods for rendering data.

To review, see Advanced Data Visualization Techniques.

 

6c. Apply the seaborn module to solve data science problems 

  • What is a matrix plot, and how can it be used to infer correlations with a data set?
  • How are joint plots useful for inferring data classification strategies?
  • How is seaborn used to visualize regressions?

In data science, knowing the syntax is half the battle; the other half is knowing how to apply problem-solving methods. In certain applications, a picture can truly be worth a thousand words. Plots of matrix data using a matrix plot such as heatmap can be extremely useful for visualizing empirical distributions of data. Furthermore, a correlation matrix plotted as a heatmap can be used to immediately infer positive, null, or negative correlations between variable pairs. Using a tool in this manner implies the ability to combine knowledge of statistics with advanced visualization techniques.

The jointplot method is useful for visualizing joint empirical data and will show you the marginal distributions of each variable. If the data is categorical, variable clusters can be further stratified using the hue input parameter. Applying the combination of joint, marginal, and stratified data can give an immediate picture and roadmap for designing a data classification scheme. Finally, you should be aware of the input parameter choices and subtle differences in allowed input data for plotting regressions with either lmplot or regplot. For example, both allow pandas dataframes as input, but only regplot accepts numpy arrays as input.

To review, see Advanced Data Visualization Techniques.

 

Unit 6 Vocabulary

This vocabulary list includes terms you will need to know to successfully complete the final exam.

  • barplot
  • category plot
  • ci
  • countplot
  • displot
  • distribution plot
  • estimator
  • Facetgrid
  • heatmap
  • hue
  • jointplot
  • lmplot
  • margin_titles
  • matrix plot
  • regplot
  • relational plot