4. Barriers That Hamper Data Reuse

Despite the importance, it is not easy to reuse data. There are several obstacles. We have identified five main obstacles:


Heterogeneity of Representations

There are four critical impediments to data reuse due to the heterogeneity of representations.

Heterogeneous Data Representations: there is a wide variety of scientific data models and formats and scientific information expressed in one formalism cannot directly be incorporated into another formalism.

Heterogeneity of Query Languages: Data collections are managed by a variety of systems that support different query languages. It is difficult to share data if they are encoded in different dialects.

Lack of Communication Conventions: Data reuse does not necessarily require a shared database. If separate systems can communicate with one another, they can benefit from each other's database without sharing a common database. Unfortunately, this approach is not generally feasible for today's scientific database systems and file repositories as these systems are not based on formal data models, and thus, making them interoperable is very difficult. We lack an agreed-on protocol specifying how these systems are to query each other and in what form answers are to be delivered. Similarly, we lack standard protocols that would provide interoperability between research data infrastructures.

Vocabulary Mismatching: another barrier to data reuse is when a common vocabulary and domain terminology is lacking.


Discovering Data

Researchers must be aware of who has the data they need or where the data are located. In a networked scientific multidisciplinary environment, pinpointing the location of relevant data is a big challenge for researchers. A data discovering capability requires the support of appropriate metadata descriptions and registries, data classification/categorization schemes, as well as definitions of researcher profiles and goals.

In addition, after finding appropriate data, researchers must often negotiate with the owner or develop trusting relationships to gain access.


Understanding Data

Once in possession of a data set, the next problem regards the capacity of the data user to understand the information/knowledge embodied in it. Data understandability must be built on a fundamental premise: a data set is intelligible only when its metadata relates to its intended use. An additional difficulty arises when providing the same data set for different user communities. In this case, appropriate abstractions of the data set must be created for the different communities.

To make data understandable, they must be endowed with auxiliary information, including metadata, community-specific ontologies or taxonomies, and terminologies.

However, much of the knowledge needed to make sense of a data set is tacit. Scientists are not necessarily able to explicate all of the information that is required to understand someone else's work.


Moving Data

In the scientific data universe, actors and data collections inhabit multiple contexts. There is the risk, when data are moving across contexts, of interpreting data representations in different ways caused by the loss of the interpretative context. This can lead to a phenomenon called "ontological drift" as the intended meaning becomes distorted as the data move across semantic boundaries (semantic distortion). This risk arises when a shared vocabulary and domain terminology are lacking.


Data Mismatching

There are several data mismatching problems that hamper data reusability:

Quality mismatching occurs when the quality profile associated with a data set does not meet the quality expectations of the user of this data set.

Data-incomplete mismatching occurs when a data set is lacking some useful information (for example, provenance, contextual, uncertainty information) to enable a data user to fully exploit it.

Data abstraction mismatching occurs when the level of data abstraction (spatial, temporal, graphical, etc.) created by a data author does not meet the expected level of abstraction by the data user.