Modeling and Management of Big Data in Databases

The main real-world datasets used in the studies analyzed for this paper were sensor data, image metadata, website publications, and electronic documents. Most of the studies analyzed did not document the specific languages they used to model their data or the tool they used. But due to the need to analyze large volumes of data with various structures, which arrive in high frequency, database research became more focused on NoSQL than relational databases. Why might a NoSQL vs. Relational approach be best for database management, according to growing trends captured in this review of research?

1. Introduction

1.1. Big Data Concepts

In this part, we describe the main concepts related to Big Data, in order to provide a general overview for the reader and a background of the terms discussed later.


1.1.1. A Brief History of Big Data

The production and processing of large volumes of data began to be of interest to researchers many years ago. By 1944, estimations for the size of libraries, which increased rapidly every year, were made in American universities. In 1997, at the Institute of Electrical and Electronics Engineers (IEEE) Conference on Visualization, the term "Big Data" was used for the first time during the presentation of a study about large datasets' visualization.

Big Data is the buzzword of recent years, that is, a fashionable expression in information systems. The general population relates the term Big Data to its literal meaning of large volumes of data. However, Big Data is a generic term used to refer to large and complex datasets that arise from the combination of famous Big Data V's that characterize it.


1.1.2. Big Data Characterization

As mentioned before, Big Data does not refer only to high volumes of data to be processed. At the beginning of the Big Data studies, their volume, velocity and variety were considered as fundamental characteristics, which were known as the three Vs of Big Data. After advances in the research, new Vs, such as value and veracity, were established. Currently, there are authors who propose up to 42 characteristics needed to consider data as Big Data, therefore, they define 42 Vs for Big Data. For the purposes of our study, we will mention only ten Vs of Big Data, that are presented in a scientific study. Table 1 summarizes each characteristic, along with a brief description.

Table 1. Ten Vs Big Data.

Characteristic 
Brief Description
Volume Large data sets
Velocity High data generation rate
Variety Different type of data formats
Variability Consistent data
Viscosity Data velocity variations
Virality Data transmission rate
Veracity Accuracy of data
Validity Assessment of data
Visualization Data symbolization
Value Useful data to retrieve info

1.1.3. Volume and Velocity

To deal with the Volume and Velocity characteristics of Big Data, ecosystems and architectural solutions, such as lambda and kappa, have been created. Both architectures propose a structure of layers to process Big Data; the main difference between them is that lambda proposes a layer for batch data processing and another for streaming data, while kappa proposes a single layer for both batch and streaming processing. This SLR focuses on data modeling, a concept related to the Variety characteristic, which is explained next.


1.1.4. Variety

Variety is a characteristic referring to the different types of data and the categories and management of a big data repository. As per this characteristic, Big Data has been classified into structured, semi-structured and unstructured data. The next subsections explain in detail each data type.

Structured Data

In Big Data, structured data are represented in tabular form, in spreadsheets or relational databases. To deal with this type of data, widely developed and known technologies and techniques are used. However, according to the report presented by the CISCO company, this type of data only constituted 10% of all existing data in 2014. Therefore, it is very important to analyze the 90% of the remaining data, corresponding to the semi-structured data and unstructured data that will be described below.

Semi-Structured Data

Semi-structured data are considered to be data that do not obey a formal structure, such as a relational database model. However, they present an internal organization that facilitates its processing; for instance, servers' logs in comma-separated values (csv) format, documents in eXtensible Markup Language (XML) format, JavaScript Object Notation (JSON) and Binary JSON (BSON) and so forth. Some authors may consider XML and JSON as structured.

Unstructured Data

Unstructured data are considered those that have either no predefined schema or no organization in their structure. Within this type of data are text documents, emails, sensor data, audio files, images files, video files, data from websites, chats, electronic health records, social media data and spatio-temporal data, among others. According to CISCO, the volume of unstructured data between 2017 and 2022 is expected to increase up to twelvefold.

To support the Variety, Volume and Velocity of Big Data, non-relational, distributed and open source data storage systems have been created. These systems include horizontal scalability, linearization, high availability and fault tolerance. Usually, these databases are known as NoSQL.