6. Enabling Technologies

6.3. Data Discovering

By Data Discovery, we mean the capability to quickly and accurately identify and find data that supports research requirements. The process of discovering data that exist within a data collection/database is supported by search and query functionality which exploits data registration and citation capabilities; and metadata descriptions contained in data categorization/classification schemes, data dictionaries, data inventories, and metadata registries.

Data Registration. By Data Registration capability, we mean a capability enabling researchers to make data citable as a unique piece of work. Once accepted for deposit, data should be assigned a "Digital Object Identifier" (DOI) for registration. A DOI is a unique name (not a location) within the scientific data universe and provides a system for persistent and actionable identification of data. Identifiers should be assigned at the level of granularity appropriate for an envisaged functional use. The Data Registration capability should include a specified numbering syntax, a resolution service, a model, and an implementation mechanism determined by policies and procedures for the governance and application of DOIs.

Data Citation. Data can also be identified and accessed through a publication by means of a data citation capability. By data citation capability, we mean a capability providing a reference to data in the same way as researchers routinely provide a bibliographic reference to printed resources. A Data Citation capability should include a minimum of five components: the author of the dataset, the date the data set was published, the data set title, a Unique Global Identifier system (Life Science Identifiers (LSID), Digital Object Identifier (DOI), Uniform Resource Name (URN), etc.) and a Universal Numeric Fingerprint (UNF). The UNF is a short, fixed-length string of numbers and characters that summarize the content of the data set, such that a change in any part of the data would produce a completely different UNF. The fifth component is necessary because unique global identifiers do not guarantee that the data do not change in any meaningful way when the data storage formats change. Together, the Global Unique Identifier and UNF ensure permanence, verifiability, and accessibility, even in situations where the data are confidential, restricted, or proprietary.

Data Classification. Data Classification is the categorization of data for its most effective and efficient use. Data can be classified according to any criteria. A well-planned data classification system makes essential data easy to find. This can be of particular importance in data discovery. A classification scheme should allow/help scientists to effectively answer the following questions:

  • What data types are available?
  • Where are certain data located?
  • What access levels are implemented?
  • What protection level is implemented and does it adhere to compliance regulations?

Although data classification is typically a manual process, there are many tools from different vendors that can help gather information about the data. They help to categorize data for several purposes.

Data Dictionary. Data Dictionaries contain the information about the data contained in large data collections. Each data element is defined by its data type, the location where it can be found, and the location where it came from. Often, the data dictionary includes the logic when a field is derived. Typically, each data collection has its own data dictionary.

Metadata Registry. By domain-specific Metadata Registry, we mean a registry used to describe, document, protect, control, and access informational representations of a scientific domain. There are two types of metadata registry: (i) metadata schema registries which are databases containing metadata schemas relative to the data collections/databases of a scientific domain; (ii) metadata registries that hold metadata and reference information, a kind of index of terms regarding the data stored in the data collections/databases of a scientific domain. These two types of registry can be components of a 2-tiered metadata registry architecture.

A Metadata Registry supports data reuse as it:

  • holds precise data definitions and descriptions;
  • holds documentation of data characteristics;
  • provides guidance for the identification of data elements stored in data collections/databases;
  • provides means for organizing standard shareable data elements; and
  • sets up common standards between communities of practice