6. Enabling Technologies

6.1. (Meta) Data Modeling

In order to facilitate data understandability, it is necessary to define and develop formal models that adequately describe:

  • data representation needs of a given scientific discipline;
  • data provenance information;
  • data contextual information;
  • data uncertainty;
  • data quality information.

All this information is collectively called metadata information. If scientists are to reuse data collected by others, then the data must be carefully documented. Metadata is the descriptive information about data that explains the measured attributes, their names, units, precision, accuracy, data layout, and ideally a great deal more. Most importantly, metadata should include the data lineage, i.e., how the data was measured, acquired, or computed. The use of purpose-oriented metadata models is of paramount importance to achieve data reusability. Data is incomprehensible and hence useless unless there is a detailed and clear description of how and when it was gathered, and how the derived data was produced. The type of descriptive information to be provided by the data author depends very much on the usability relations established between the data authors and users.

Data Provenance Modeling: In its most general form, provenance (also sometimes called lineage) captures where data came from, and how it has been updated over time. Provenance can serve a number of important functions: explanation, verification, re-computation, and repeatability. In the long-term, a standard open representation and query model is needed. A promising example is the "Open Provenance Model", a community-driven model, which allows provenance to be exchanged between systems.

Data Context Modeling: Context is a poorly used source of information in our computing environments. As a result, we have an impoverished understanding of what context is and how it can be used.

Contextual information is any information that can be used to characterize the situation of a digital information object. In essence, this information documents the relationship of the data to its environment. Context is the set of all contextual information that can be used to characterize the situation of a digital information object.

Several context modelling approaches exist and are classified by the scheme of data structures which are used to exchange contextual information in the respective system: Key-value Models, Mark-up Scheme Models, Object Oriented Models, Logic Based Models, and Ontology Based Models.

Data Uncertainty Modeling: As models of the real world, scientific datasets are often permeated with forms of uncertainty. Uncertainty is the quantitative estimation of error; all measurements contain some uncertainty generated through systematic error and/or random error. Acknowledging the uncertainty of data is an important component of reporting the results of scientific investigation.

There has been a significant amount of work in areas variously known as "uncertain, probabilistic, fuzzy, approximate, incomplete and imprecise" data management.

Unfortunately, current data management products do not support uncertainty. Undoubtedly, the development of suitable database theory to deal with uncertain database information remains a challenge that has yet to be met.

Data Quality Modeling: The quality of data is a complex concept, difficult to define. There is no common or agreed upon definition or measure for data quality, apart from such a general notion as fitness for use.

The consequences of poor data quality are often experienced in all scientific disciplines, but without making the necessary connections to its causes. Awareness of the importance of improving the quality of data is increasing in all scientific fields.

In order to fully understand the concept, researchers have traditionally identified a number of specific quality dimensions. A dimension or characteristic captures a specific facet of quality. The more commonly referenced dimensions include accuracy, completeness, consistency, currency, timeliness, and volatility.

For specific categories of data and for specific scientific disciplines, it may be appropriate to have specific sets of dimensions.

Metadata is as valuable as the data itself. The use of metadata and their accuracy have increased over the past several decades. The quality of metadata is probably the single most important factor that enables the reusability of scientific data.

Data Paper. Recently, a mechanism, the data paper, able to improve data understandability and, thus, data reusability has been proposed. A data paper can be defined as a scholarly publication of a searchable metadata document describing a particular on-line accessible dataset, or a group of datasets, published in accordance to the standard academic practices. In essence, a data paper is a journal publication whose primary purpose is to describe data, rather than to report a research investigation. As such, it contains facts about data, not hypotheses, and arguments in support of those hypotheses based on data, as found in a conventional research article. Its purpose is threefold: (i) to provide a citable journal publication that brings scholarly credit to data authors; (ii) to describe the data in a structured human-readable form; and (iii) to bring the existence of the data to the attention of the scholarly community.

A data paper should describe how the data sets were collected/created, who collected/created them and who owns these data sets, which software was used to create the data sets, the spatial and temporal coverage of the data sets, etc. It could also include sections summarizing the history of the data set, e.g., original purpose, funding body, etc., as well as its perceived value and usefulness to scientific research (fundamental and/or applied).

An important feature of data papers is that they should always be linked to the published datasets they describe, and that this link (an URL, ideally resolving a digital object identifier, DOI) should be published within the paper itself.