Modeling and Management of Big Data in Databases
3. Results
3.3. Discussion
In this section, we answer RQ3. Firstly, we refer to the relevant information attained from the bibliometric analysis. From this analysis, it is possible to verify the growth of this research topic. Since 2015, the number of studies has increased around 100 times and 2018 is the year with the most publications. Scopus is the source holding the highest number of relevant studies. The countries with the greatest contribution in the topic of Big Data modeling are the USA and China and more than 50% of the studies are funded.
Those data suggest existent interest in the topic in the research community. Although many researches refer to the Big Data analytics side, the engineering that takes care of the platforms, extraction techniques and loading and transforming data until reaching the required storage, is of no lesser importance, since these comprise the starting point required by analytics processes in order to produce value from the data.
Next, we will take the information attained from the SLR as a reference to discover the trends and gaps in the topic of interest.
3.3.1. Trends
To summarize the observed trends, Figure 17 helps as a reference to present the three main concepts that this analysis is focused on: source, modeling and database.
Figure 17. Trends for Big Data Modeling and Management.
Source
For the Source aspect, the most researched data sources are unstructured data, specifically website information.
Modeling
For the Modeling aspect, seven main trends were identified within the analyzed corpus:
- Most studies present their proposals at the conceptual, logical and physical abstraction levels;
- ER model is the most used in the approaches at the conceptual abstraction level, followed by the multidimensional model and, thirdly, XML;
- At the logical abstraction level, the most researched model is document-oriented, followed by column-oriented and graph-oriented;
- At the physical abstraction level, implementations focus more on the MongoDB DBMS, followed by Neo4j and Cassandra. Thus, the following de facto standards have emerged, MongoDB for the document-oriented data model, Cassandra for column-oriented and Neo4j for graph data model. These data are supported by statistical information from DB-Engines Ranking - Trend Popularity of the Solid IT company as of December 2019;
- The most proposed modeling methodology is data-driven;
- There is not a clear tendency towards a data modeling approach but there are proposals with UML and XML;
- No data modeling tool is defined as a trend but some studies used EMF;
- Regarding the different fields of application, ER is commonly used at the conceptual abstraction level in the different case studies. At the logical level, approaches for the migration from relational to document-oriented, graph and column-oriented models are proposed. Specifically, for spatio-temporal and transmission data, solutions for the graph model are proposed. And, for XML and JSON formats from web servers, solutions for all NoSQL data models are suggested.
Database
For the Database aspect, 55.55% of the studies focus on homogeneous database types, so the trend towards a persistent polyglot system is not observed.
3.3.2. Gaps
Figure 18 allowed us to identify also three main concepts for analyzing the gaps: source, modeling and database.
Figure 18. Gaps for Big Data Modeling and Management.
Source
For the Source aspect, two main gaps were identified within the analyzed corpus:
- One of the main objectives of Big Data is to use the data to generate value; this value will depend on the needs of the business. For this reason, it is considered very important that studies are validated with real use cases, since only with the use of real datasets can whether value is being generated be verified. As result of our SLR, only ten studies, corresponding to 27.78%, use real datasets; 16.67% of these works present their case studies with data from websites, 5.56% use data from sensors and 2.78% involve electronic documents' data and images' metadata;
- Additionally, other Big Data main features must be guaranteed, such as volume, velocity and veracity. According to the results of Table 6, not all of these features are justified in the approaches;
- To comply with the variety, studies should consider that data can come in any format: structured, semi-structured or unstructured. Therefore, the proposed approaches should address any of these types. Only 2.78% have proposed a solution for all three types of data. The remaining 47.22% of the studies only present approaches for unstructured data, 30.56% only for structured data, 13.89% for semi-structured data, 2.78% combine structured and semi-structured data and 2.78% do not specify any type.
Modeling
For the Modeling aspect, the five main gaps found in our study correspond to standardization and data modeling methodology. Standards are considered important, since they guarantee a universal, uniform interpretation, readability of the data model, portability between database engines, platforms and applications, among other facilities for project managers, analysts, designers, developers and end-users of the databases. In summary, the results are as follows:
- There is no standardization regarding the definition of mapping rules for the transformation between models at the conceptual, logical and physical data abstraction levels. Thus, the 36 studies propose different approaches for the transformation, making it difficult for users to choose the most appropriate one;
- No NoSQL system has emerged as a standard or as a de facto standard yet;
- There is no clearly defined use of some standardized language or method for modeling data at the logical and physical levels. According to the results of our SLR, only at the conceptual level can ER be considered as a trend, maybe because the conceptual level is technologic-agnostic;
- As mentioned in the Modeling Language subsection, for NoSQL databases the new paradigm for the modeling process is query-driven. However, only five studies are focused on this methodology.
- For Big Data analysis, the implementation of efficient systems is required. In the data-driven methodology, the models are designed before assessing what queries will be performed. The limitation of this solution is that all data will be stored, when only a limited fraction are needed to answer the required queries. On the contrary, in query-driven methodologies, a set of queries must be expressed, evaluated and integrated before modeling, so that planning can focus on the answers to the necessary queries. For solutions where real-time data are used, data-driven models have the risk of being impractical because of the diverse nature of data streams.
- A research demonstrates that the treatment of data, in terms of time–cost, with the use of query-driven requires less processing time than data-driven. This, in turn, leads to smaller amounts of consumed energy and, therefore, longer life of equipment.
- According to another study, the use of a query-driven methodology allowed users to make queries in a natural user language that focuses on relevant regions of interest;
Techniques such as reverse engineering are not taken into account; thus, this detail could complement the studies. This method is widely used in relational databases on projects that work with existent databases and when no documented models exist (or there is a risk of these models being outdated).
Database
For the Database aspect, the main gap found is related to the variety characteristics of Big Data. To manage this feature correctly, NoSQL databases must be targeted to PPS. A PPS can be obtained through storage in hybrid databases; that is, it supports different storage technologies. Only eight studies have proposed solutions oriented to hybrid databases and, out of these, only one study demonstrated this with a real case that models and manages both structured, semi-structured and unstructured data and stores them in different databases.