Read this article. Be sure you can explain the methods (approach) for extracting data based on usability.
2. The Research Data Universe
The research data universe is complex, involving many actors using many types of data for many different scientific purposes. Recent years have witnessed the rise of a multitude of data collections that are robust and flexible, while allowing for heterogeneous data types and associated metadata developed to satisfy the wide range of requirements of diverse research communities.
Research Data
By research data, we mean scientific or technical measurements, values calculated, and observations or facts that can be represented by numbers, tables, graphs, models, text, or symbols and that are used as a basis for reasoning or further calculation. Such data may be generated by various means, including observation, computation, or experimentation. Scientists regard data as accurate representations of the physical world and as evidence to support claims.
Data can be distinguished by their origins - whether they are observational, computational, or experimental.
- Observational data are collected by direct observations and a particular feature of these data is that they cannot be recollected.
- Computational data are produced by executing a computer model or simulation; their feature is that they can be reproduced.
- Experimental data are collected by conducting experiments; in principle, data from experiments can be accurately reproduced. In practice, however, it may not be possible to reproduce precisely all of the experimental conditions.
Data can be referred to as raw, derivative, or verified.
- Raw data consist of original observations, such as those collected by satellite and beamed back to earth or generated by an instrument or sensor or collected by conducting an experiment.
- Derivative data are generated by processing activities. The raw data are frequently subject to subsequent stages of refinement and analysis, depending on the research objectives. There may be a succession of versions. While the raw data may be the most complete form, derivative data may be more readily usable by others as processing usually makes data more usable, ordered or simplified, thus increasing their intelligibility.
- Verified data are generated by curatorial activities. Their quality and accuracy have, thus, been assured.
Data Collections/Databases
Scientific data are stored into managed data collections/databases. Data collections fall into one of three functional categories as reported in:
- Research Data Collections are the products of one or more focused research projects and typically contain data that are subject to limited processing or curation. They may or may not conform to community standards, such as standards for file formats, metadata structure, and content access policies.
- Resource or Community Data Collections serve a single science or engineering community. These digital collections often establish community-level standards either by selecting from among preexisting standards or by bringing the community together to develop new standards where they are absent or inadequate.
- Reference Data Collections are intended to serve large segments of the scientific and educational community. Characteristic features of this category of digital collections are the broad scope and diverse set of user communities including scientists, students, and educators from a wide variety of disciplinary, institutional, and geographical settings. In these circumstances, conformance to robust, well-established, and comprehensive standards is essential, and the selection of standards by reference collections often has the effect of creating a universal standard.
Data Actors
The main actors in the scientific data universe are:
- Data Authors are individuals or teams involved in research activities that generate digital data that are subsequently deposited in a data collection. Their interests lie in ensuring that they enjoy the benefits of their own work, including gaining appropriate credit and recognition, and that their results can be broadly disseminated and safely archived.
- Data Users are representatives of the scientific communities. Their interests lie in having ready access to data sets that are discoverable and intelligible, i.e., well defined and well documented.
- Data Managers are individuals responsible for the operation and maintenance of the data collections/databases.
- Data Scientists are information and computer scientists developing innovative concepts in database technology and information sciences, including scientific data modeling, data discovery, data visualization, etc., and applying these to the fields of science relevant to the data collection/database.
Data Uses
Data are used in different ways according to their contexts. Two broad categories of data use can be defined:
- End Use is defined as the ability of accessing a dataset to verify some fact or perform some job-related or personal task.
- Derivative Use is defined as the ability of building on a preexisting dataset by extracting information from one or more datasets in order to create a new dataset that can be used for the same, similar, or an entirely different purpose with respect to the original dataset(s).
Diversity in the Research Data Universe
In conclusion, we can affirm that research data exist in many different types and formats subject to varying legal, cultural, protective, and practical constraints. Data authors, managers, and users often come from different disciplinary, professional, cultural, and other settings with different needs, expectations, responsibilities, authorities, and expertise. These experts are subject to varying legal, physical, scientific, cultural, and other constraints.
The diversity in data, individuals, disciplines, contexts, and cultures is the big challenge faced by researchers in order to harness the accumulating data and knowledge produced by the research communities and make them reusable.