Data Visualization in Python

At this point in the course, it is time to begin connecting the dots and applying visualization to your knowledge of statistics. Work through these programming examples to round out your knowledge of seaborn as it is applied to univariate and bivariate plots.

Introduction

Data visualization is a basic task in data exploration and understanding. Humans are mainly visual creatures, and data visualization provides an opportunity to enhance the communication of the story within the data. Often we find that data and the data-generating process are complex, and a visual representation of the data and our innate ability at pattern recognition can help reveal the complexities in a cognitively accessible way.


An example gallery

Data visualization has a long and storied history, from Florence Nightingale onwards. Dr. Nightingale was a pioneer in data visualization and developed the rose plot to represent causes of death in hospitals during the Crimean War.

John Snow's visualization of the cholera outbreak in London 1854

John Snow, in 1854, famously visualized the cholera outbreak in London, which showed the geographic proximity of cholera prevalence with particular water wells.


In one of the more famous visualizations, considered by many to be an optimal use of display ink and space, Minard visualized Napoleon's disastrous campaign against Russia


In more recent times, an employee at Facebook visualized all connections between users across the world, which clearly showed geographical associations with particular countries and regions.


Why visualize data?

We often rely on numerical summaries to help understand and distinguish datasets. In 1973, Anscombe published an influential set of 4 datasets, each with two variables and with the means, variances, and correlations being identical. When you graphed these data, the differences in the datasets were clearly visible. This set is popularly known as Anscombe's quartet.


A more recent experiment in data construction by Matejka and Fitzmaurice (2017) started with a representation of a dinosaur and created 10 more bivariate datasets, which all shared the same univariate means and variances and the same pairwise correlations.


These examples clarify the need for visualization to better understand relationships between variables.

Even when using statistical visualization techniques, one has to be careful. Not all visualizations can discriminate between statistical characteristics. This was also explored by Matejka and Fitzmaurice.

Strip plot


Boxplot


Violin plot



Conceptual ideas

Begin with the consumer in mind

  • You have a deep understanding of the data you're presenting
  • The person seeing the visualization doesn't
  • Develop simpler visualizations first that are easier to explain


Tell a story

  • Make sure the graphic is clear
  • Make sure the main point you want to make “pops”


A matter of perception

  • Color (including awareness of color deficiencies)
  • Shape
  • Fonts


Some principles

  1. Data-ink ratio
  2. No mental gymnastics
    1. The graphic should be self-evident
    2. The context should be clear
  3. Is a graph the wrong choice?
  4. Focus on the consumer


Source: Abhijit Dasgupta, https://www.araastat.com/BIOF085/data-visualization-using-python.html
Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 License.