The main real-world datasets used in the studies analyzed for this paper were sensor data, image metadata, website publications, and electronic documents. Most of the studies analyzed did not document the specific languages they used to model their data or the tool they used. But due to the need to analyze large volumes of data with various structures, which arrive in high frequency, database research became more focused on NoSQL than relational databases. Why might a NoSQL vs. Relational approach be best for database management, according to growing trends captured in this review of research?
1. Introduction
1.1. Big Data Concepts
In this part, we describe the main concepts related to Big Data, in order to provide a general overview for the reader and a background of the terms discussed later.
1.1.1. A Brief History of Big Data
The production and processing of large volumes of data began to be of interest to researchers many years ago. By 1944, estimations for the size of libraries, which increased rapidly every year, were made in American universities. In 1997, at the Institute of Electrical and Electronics Engineers (IEEE) Conference on Visualization, the term "Big Data" was used for the first time during the presentation of a study about large datasets' visualization.
Big Data is the buzzword of recent years, that is, a fashionable expression in information systems. The general population relates the term Big Data to its literal meaning of large volumes of data. However, Big Data is a generic term used to refer to large and complex datasets that arise from the combination of famous Big Data V's that characterize it.
1.1.2. Big Data Characterization
As mentioned before, Big Data does not refer only to high volumes of data to be processed. At the beginning of the Big Data studies, their volume, velocity and variety were considered as fundamental characteristics, which were known as the three Vs of Big Data. After advances in the research, new Vs, such as value and veracity, were established. Currently, there are authors who propose up to 42 characteristics needed to consider data as Big Data, therefore, they define 42 Vs for Big Data. For the purposes of our study, we will mention only ten Vs of Big Data, that are presented in a scientific study. Table 1 summarizes each characteristic, along with a brief description.
Table 1. Ten Vs Big Data.
Characteristic |
Brief Description |
---|---|
Volume | Large data sets |
Velocity | High data generation rate |
Variety | Different type of data formats |
Variability | Consistent data |
Viscosity | Data velocity variations |
Virality | Data transmission rate |
Veracity | Accuracy of data |
Validity | Assessment of data |
Visualization | Data symbolization |
Value | Useful data to retrieve info |
1.1.3. Volume and Velocity
To deal with the Volume and Velocity characteristics of Big Data, ecosystems and architectural solutions, such as lambda and kappa, have been created. Both architectures propose a structure of layers to process Big Data; the main difference between them is that lambda proposes a layer for batch data processing and another for streaming data, while kappa proposes a single layer for both batch and streaming processing. This SLR focuses on data modeling, a concept related to the Variety characteristic, which is explained next.
1.1.4. Variety
Variety is a characteristic referring to the different types of data and the categories and management of a big data repository. As per this characteristic, Big Data has been classified into structured, semi-structured and unstructured data. The next subsections explain in detail each data type.
Structured Data
In Big Data, structured data are represented in tabular form, in spreadsheets or relational databases. To deal with this type of data, widely developed and known technologies and techniques are used. However, according to the report presented by the CISCO company, this type of data only constituted 10% of all existing data in 2014. Therefore, it is very important to analyze the 90% of the remaining data, corresponding to the semi-structured data and unstructured data that will be described below.
Semi-Structured Data
Semi-structured data are considered to be data that do not obey a formal structure, such as a relational database model. However, they present an internal organization that facilitates its processing; for instance, servers' logs in comma-separated values (csv) format, documents in eXtensible Markup Language (XML) format, JavaScript Object Notation (JSON) and Binary JSON (BSON) and so forth. Some authors may consider XML and JSON as structured.
Unstructured Data
Unstructured data are considered those that have either no predefined schema or no organization in their structure. Within this type of data are text documents, emails, sensor data, audio files, images files, video files, data from websites, chats, electronic health records, social media data and spatio-temporal data, among others. According to CISCO, the volume of unstructured data between 2017 and 2022 is expected to increase up to twelvefold.
To support the Variety, Volume and Velocity of Big Data, non-relational, distributed and open source data storage systems have been created. These systems include horizontal scalability, linearization, high availability and fault tolerance. Usually, these
databases are known as NoSQL.