3. The Conceptual Foundations of Data Reusability

3.3. Data Abstraction and Data Representation

Research data provide an account of the results of a scientific work and therefore they must be intelligible to those wishing to understand or scrutinize them. Therefore, data communication must be differentiated for different categories of audiences with different scientific and cognitive backgrounds. This means that effective data communication should enable recipients to scrutinize a dataset at a level of abstraction and at a level of representation that are more appropriate for their scientific background and research interests.


3.3.1. Data Abstraction

There are two main varieties of abstraction: ontological and epistemological. The ontological approach to abstraction, is concerned with the different levels of organization of a system that can be identified and defined. For example, a database can have conceptual, semantic, syntactic, and physical levels of organization.

The epistemological approach to abstraction is concerned with the different levels of observation or interpretation at which a system can be studied. For example, a database can be observed and analyzed at different levels of abstraction, consisting of data related by time, place, instrument, or object of observation. Examples of epistemological levels of abstraction are spatial and temporal data abstractions.

Basic Concepts: In order to be able to describe the "method of abstraction", three concepts are introduced:
Typed Variable: A "typed variable" is a uniquely-named conceptual entity (the variable) and a set (called its type), consisting of all the values that the entity may take. Two typed variables are regarded as equal if and only if their variables have the same name and their types are equal as sets. A variable that cannot be assigned well-defined values is said to constitute an ill-typed variable.

The degree to which a type is appropriate depends on its context and use.

Observable: An "observable" is an interpreted typed variable, that is, a typed variable together with a statement of which feature of a scientific data collection (for example, spatial, temporal, graphical, visual) under consideration it represents.

The definition of an observable reflects a "particular view" or "attitude" towards the data collection being studied. Most commonly, it corresponds to a simplification.

Level of Abstraction: A level of abstraction (LoA) is a finite but non-empty set of observables. No order is assigned to the observables. Different LoAs may be appropriate for different purposes. The definition of observables is the first step in studying a data collection at a given LoA. The second step consists in deciding what relationships are held between the observables.


The Method of Abstraction

As scientific databases should be studied at different levels of abstraction, a method for specifying these different levels of abstraction must be defined.

In order to be able to specify a level of abstraction, first, the range of queries which can be meaningfully asked by the target audience, and that are answerable in principle, must be identified. The input of a level of abstraction consists of the scientific database under analysis; its output is an abstract view of the database. The type and amount of data vary with the level of abstraction: a lower level of abstraction produces a view that contains more data than a view produced at a higher, or more abstract level. Therefore, type and quantity of data that must be taken into consideration when specifying a level of abstraction is predetermined by the choice of this level. In essence, a given level of abstraction provides a quantified commitment to the kind and amount of data that can be extracted from a scientific database. The observables at a given level of data abstraction can be obtained as a result of a query issued against a database.

The method of abstraction is ideally suited to the study and analysis of large and complex databases derived from experiments and from upcoming petascale and exascale simulation systems. They are best understood stepwise, that is, by their gradual disclosure at increasingly fine levels of abstraction.

Several data abstraction approaches are currently used by data scientists in order to improve data accessibility and understandability; among them, we list the most relevant:

Metadata. An ontological data abstraction level that is of paramount importance in the domain of scientific data is the metadata abstraction level. This abstraction level captures the information content of the underlying data independent of representational details. Metadata descriptions enable representation of domain knowledge describing the information domain to which the underlying data belong.

Data Virtualization. An important ontological data abstraction level is data virtualization; it hides all the technical aspects of data storage; the data users do not have to know where all the data have been stored physically, where the database servers run, what the source Application Programming Interface API and database language is, and so on.

Data Clustering. An epistemological data abstraction approach is data clustering. It allows the grouping of the data into clusters; the data contained in a cluster are similar to each other while data belonging to different clusters are dissimilar.

In conclusion, we argue that (i) epistemological abstraction should be retained as a proper abstraction method for increasing data reusability as it supports the definition of several levels of explanation and interpretation of a scientific database; and (ii) ontological abstraction should be retained as a proper abstraction method for increasing data accessibility and understandability as it supports the definition of several levels of organization of a scientific database.

Finally, we argue that the right level of abstraction to be communicated to a given data consumer community should be based on the "usability" relation established with the data producer community.


3.3.2. Data Representation

Appropriate data representation is essential for enabling scientists to correctly interpret data and use them appropriately as the same information content can be represented differently in different data description languages. A major problem is that we have no shared formal conceptual model of data representation that is both accurate and sufficiently detailed to support the data needs of scientists belonging to different scientific disciplines. The traditional relational data model is not adequate to represent the data needs of most of the scientific disciplines. For some scientific disciplines (astronomy, oceanography, fusion, and remote sensing), an array data model is more appropriate. Some other disciplines, i.e., biology and genomics, consider graphs and sequences more appropriate for their needs. Lastly, solid modelling applications want a mesh data model. In the big data era, pictorial representation of data is of paramount importance. It makes the presentation of data more intelligible, and allows investigators to easily see the salient features of the data, and bring out the hidden pattern and trends of the complex datasets. Two of the main approaches to the pictorial representation of data are shown below:

Visual Representation of Data. Effective data visualization improves interpretation of data and helps scientists in analyzing and reasoning about data and evidence. Visual data analysis enables the detection and validation of expected results while also enabling unexpected discoveries in science. Data visualization makes complex data more accessible, understandable, and usable. In the big data era, data visualization is an indispensable technique for extracting meaning from large and complex scientific datasets.

Graphical Representation of Data. Graphical methods are also well suited for digesting great amounts of data. Investigators can have a better look at the information collected and the distribution of data. The graphic method of the representation of data enhances our understanding, makes the comparisons easy, and creates an imprint on the mind for a longer time.