Storytelling with Data and GGPlot

The most effective storytelling helps the audience reach the right conclusion and take the appropriate action. This article provides an example of how data is used to tell a story about reducing global sea ice. Do you believe the author was successful in telling their story? What would you have done differently?

Good data science has to be aesthetic to ensure that the person who consumes your data product draws the right conclusion. Storytelling with data is a craft where mathematics and aesthetics meet. For data science to provide value, it has to be sound, useful and aesthetic. The aesthetics of data science refers to the way the results are communicated. This article discusses an informative but messy graph about the extent of global sea ice. The second part describes how to improve it with the ggplot2 library.

Many data scientists discuss at length which computing language they should learn. I claim that English is the most essential language an analyst can learn (or whatever other language yous peak) and the language of art. A good visualisation is a piece of data art that is composed to achieve a purpose. Whenever somebody looks at your visualisation, you want them to reach the same conclusion as you and they should be able to do so without having to dissect the information.

I subscribe to both the Data is Beautiful and Data is Ugly subreddits. While beauty might be in the eye of the beholder, aesthetic data science has to follow some rules. I subscribe to the concept that you need to maximise the data-pixel-ratio of your graphs. In other words, each pixel in your chart should ideally form part of the story. No superfluous elements, only use colour if it is part of the story. Also, the type of chart is quite important, as each geometry tells a different story. Lets put this to practice with some ggplot.

This post has been updated based on feedback from Clark Richards as the original used the wrong data set.


Storytelling with data case study: Receding sea ice sheets

A little while ago, one of the posts on the Data is Beautiful subreddit showed a disturbing graph about the changes in the extent of global sea ice from 1978 till recently. The chart is created by an anonymous citizen data scientist who calls him or herself Wipneus, which is Dutch for a snub nose and also a character in an old Dutch comic.

This graph contains all the necessary information to come to the conclusion that the total surface area of global sea ice in the has reduced substantially over the past forty years. This conclusion does, however, require quite a bit of squinting and close examination of the graph.

Global Sea Ice Area 1978--2020

Global Sea Ice Area 1978–2020.

This graph is a typical multivariate time series where the colour of each line indicates a category. While this approach might work fine for one or two lines, the cacophony of colours makes it hard to distinguish which line belongs to which year. This graph would be impossible to interpret for the eight per cent of men who are colour blind. Also, the story this graph tells is confusing. While the story seems to be that sea ice sheets are melting the past forty years, the chart shows the seasonal variations.


Storytelling with data and ggplot

To redesign this graph, we first need to clearly define what the story is we are telling with this data. The original chart shows that the area of the global sea ice is receding over the past forty years. This means that our dependent variable (y-axis) is the surface area, and the independent variable (x-axis) is the year.

The original graph shows the month as the independent variable. The sea ice sheets grows and shrink with the seasons, as the sinusoidal shape in the chart indicates. While this is an interesting pattern, it is not the story we want to tell with this data. To show the influence over time, the year should be the independent variable and perhaps a colour for each month. To prevent a palette with twelve colours, it is probably better to use a grid for each month. The graph below is my proposed improved version.

The ribbon indicates the maximum and minimum extent of ice for each month over the years. For many months, the maximum amount of ice in recent years is about the same as the minimum amount at the start of this time series. The dark blue line is the linear regression of the mean sea ice area, which clearly indicates the downward trend.

Gloabl Sea Ice Area 1978-2020 for each month of the year.

Gloabl Sea Ice Area 1978-2020 for each month of the year.

Good storytelling with data requires that you first define the conclusion you want the consumer of your data product to draw. The next step is to identify the dependent and independent variables and perhaps some grouping as well. Only after you can clearly define these aspects can you choose the best geometry. There are many chart choosers available on the web that help you with this choice.

The code below loads the data from the website and transforms it into a tidy format, calculates the minimum, mean and maximum sea ice area and visualises the results.

library(dplyr)
  library(forcats)
  library(readr)
  library(ggplot2)

  download.file("https://sites.google.com/site/arctischepinguin/home/sea-ice-extent-area/data/nsidc_global_nt_final_and_nrt.txt.gz", destfile = "global-sea-ice.gz")
  system("gunzip global-sea-ice.gz")
  sea_ice <- read_csv("global-sea-ice", skip = 21)

  ## Original plot
  ggplot(sea_ice, aes(yearday, area, col = format(date, "%Y"))) +
    geom_line() +
    scale_color_discrete(name = "") +
    labs(title = "Global Sea Ice Area",
         subtitle = "from NSIDC NASA Team sea ice concentration data",
         y = expression(Sea~Ice~Area~10^6~km^2))

  ## Ribbon version
  sea_ice_min_max <- sea_ice %>%
    mutate(month = fct_relevel(as.factor(format(date, "%b")), month.abb),
           year = as.numeric(format(date, "%Y"))) %>%
    group_by(year, month) %>%
    summarise(min_area = min(area, na.rm = TRUE),
              mean_area = mean(area, na.rm = TRUE),    
              max_area = max(area, na.rm = TRUE))

  ggplot(sea_ice_min_max, aes(x = year, ymin = min_area, ymax = max_area)) +
    geom_ribbon(fill = "dodgerblue", alpha = .5) +
    scale_y_continuous(breaks = seq(12, 24, 2)) +
    geom_smooth(aes(y = mean_area), se = FALSE, method = "lm") + 
    facet_wrap(~month) +
    labs(title = "Global Sea Ice Area",
         subtitle = "from NSIDC NASA Team sea ice concentration data\nMinimum and maximum sea ice area and linear regression over the mean",
         y = expression(Sea~Ice~Area~10^6~km^2)) +
    theme_light(base_size = 8) +
    theme(panel.background = element_rect(fill = "#f5f5dc"))
  ggsave("../../static/images/data-science/ice-data-beautiful.png", width = 6, height = 4)


Source: Peter Prevos, https://lucidmanager.org/data-science/storytelling-with-data/
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 License.

Last modified: Monday, May 17, 2021, 10:08 PM