1. Introduction and Motivation

1.3. Data Classification

Our secondary level of classification is based on the origin of the data used in each paper. The data source list can be seen in Table 1. Each paper is assigned a type of data source.

We break the data sources down into primary and secondary. Primary data can be defined as "Original data collected by the visualisation researchers for a specific research goal". Secondary data can be defined as "Data originally collected for a different purpose and reused for another research question". We also include one sub-classification of web-scraped data that is a hybrid of both primary and secondary.

To illustrate these two, we present a data collection pipeline for visualisation in Figure 2. We use the term "A Priori" to describe the visualisation hypothesis that is formulated before the data is collected. If the initial data are collected after the visualisation hypothesis, then we refer to this as an "A Priori Hypothesis".


Figure 2. The differences between primary and secondary data sources in visualisation. Research involving primary data proposes the visualisation hypothesis before the data is collected (a priori hypothesis), here as research involving secondary data proposes the visualisation hypothesis with the knowledge that the data has already been collected (a posteriori hypothesis). The hybrid pipeline contains two stages of data collection pre and post hypothesis. The initial creation of the data is often in the context of social media. At this point, the data are not collected to support the hypothesis posed by the visualisation research. Once the visualisation hypothesis is formulated, the data can then be scraped and collated into a second structured dataset for the purpose of visual research.


Primary:

In this category, we include:

  • Intentional, Active Digital Collection
  • Intentional, Active Research Study Data

The first primary data source is the collection of data for the explicit purpose of the visualisation research featured in this survey. For example, Otsuka et al. collected data through electronic name tags worn by members of staff that identify and record interactions between the staff. These data are then used to visualise the inter-office relationships.
Study data are collected to support an a priori hypothesis first hand through interviews, questionnaires, and reviews. The most popular use of study data in this survey lies in the internal intelligence classification. These data are typically collected as part of the visualisation research. For example, Kandel et al. presented an interview study with data visualisation analysts working with industry to characterise the process of industrial data analysis.


Secondary:

Secondary data sources are not collated by those performing the visualisation research featured in this survey. Here, the researchers pose a hypothesis a posteriori, i.e., after the initial data collection. Researchers use the pre-existing data to explore and perform the analysis. In this category, we include:

  • A Priori Databases
  • Business Processes
  • Business By-product

Pre-existing databases are often used as a case study to demonstrate new visualisation techniques. They are databases that are created for the purpose of previous analysis, and not for the visualisation research in which it is currently used. Roberts et al. used a pre-existing database for their treemap-based research on call centre data, provided by their industry partner.

Business process visualisation refers to the graphical representation of the operational procedures implemented within a company. For example, Broeksema et al. presented a visualisation system for business decision management and the processes behind these decisions.

The data by-product of a business is similar to pre-existing databases except that the data is collected as a by-product of business operations. These databases are often in the form of financial records, or Point of Sale (PoS) transactions. For example, Keim et al. presented a novel approach to bar chart designs using transactions data. The business by-product differentiates from the pre-existing database as data collected as a by-product will be part of a continuous generation and collection, regardless of who uses it.


Hybrid:

We classify web scraping data as a hybrid of both primary and secondary data. Although the data are originally gathered for a different purpose other than visualisation (Secondary), the collation of a new, structured dataset is performed after a hypothesis for visualisation is posed (Primary). This leads to our creation of the hybrid primary/secondary classification. It is often used in the field of customer feedback (see Figure 1) and often involves social media data. The data are downloaded from various online sources into an archive that can be used for research purposes. For example, Hao et al. presented a visual sentiment analysis of customer feedback streams through the scraping of Twitter data.

Web-scraped data are differentiated from a digital collection due to the process by which the data is obtained. Web scraped data utilise and collate online sources of information into one structured dataset. The real-world, digital collection of data utilises hardware components that collect information offline.