Analyzing Big Data can create significant advantages for an organization because it enables the discovery of patterns and correlations in datasets. This paper discusses the state of Big Data management with a particular focus on data modeling and data analytics.
Discussion
In this section we compare and discuss the
approaches presented in the previous sections in terms of the two
perspectives that guide this survey: Data Modeling and Data Analytics.
Each perspective defines a set of features used to compare Operational
Databases, DWs and Big Data approaches among themselves.
Regarding
the Data Modeling Perspective, Table 2 considers the following features
of analysis: (1) the data model; (2) the abstraction level in which the
data model resides, according to the abstraction levels (Conceptual,
Logical and Physical) of the database design process; (3) the concepts
or constructs that compose the data model; (4)
the concrete languages used to produce the data models and that apply
the previous concepts; (5) the modeling tools that allow specifying
diagrams using those languages and; (6) the database tools that support
the data model. Table 2 presents the values of each feature for each
approach. It is possible to verify that the majority of the data models
are at a logical and physical level, with the exception of the ER model
and the OLAP cube model, which are more abstract and defined at
conceptual and logical levels. It is also possible to verify that Big
Data has more data models than the other approaches, what can explain
the work and proposals that have been conducted over the last years, as
well as the absence of a de facto data model. In terms of concepts,
again Big Data-related data models have a more variety of concepts than
the other approaches, ranging from key-value pairs or documents to nodes
and edges. Concerning concrete languages, it is concluded that every
data model presented in this survey is supported by a SQL-DDL-like
language. However, we found that only the operational databases and DWs
have concrete languages to express their data models in a graphical way,
like Chen's notation for ER model, UML Data Profile for Relational
model or CWM for multidimensional DW models. Also, related to that
point, there are none modeling tools to express Big Data models. Thus,
defining such a modeling language and respective supporting tool for Big
Data models constitute an interesting research direction that fills
this lack. At last, all approaches have database tools that support the
development based on their data models, with the exception of the ER
model that is not directly used by DBMSs.
Table 2. Comparison of the approaches from the Data Modeling Perspective.
Approaches Features |
Data Model |
Abstraction Level |
Concepts |
Concrete Languages |
Modeling Tools |
DB Tools Support |
Operational |
Entity- Relationship Model |
Conceptual, Logical |
Entity Relationship Attribute Primary Key Foreign Key |
Chen's, Crow's foot, Bachman's, Barker's, IDEF1X |
Sparx Enterprise Architect, Visual Paradigm, Oracle Designer, MySQL Workbench, ER/Studio |
|
Relational Model |
Logical, Physical |
Table Row Attribute Primary Key Foreign Key, View, Index |
SQL-DDL, UML Data Profile |
Sparx Enterprise Architect, Visual Paradigm, Oracle Designer, MySQL Workbench, ER/Studio |
Microsoft SQL Server, Oracle, MySQL, PostgreSQL, IBM DB2 |
|
Decision Support |
OLAP Cube |
Conceptual, Logical |
Dimensions, Levels, Cube faces, Time dimension, Local dimension |
Common Warehouse Metamodel |
Essbase Studio Tool, Enterprise Architect, Visual Paradigm |
Oracle Warehouse Builder, Essbase Studio Tool, Microsoft Analysis Services |
Star Schema |
Logical, Physical |
Fact table, Attributes table, Dimensions, Foreign Key |
SQL-DDL, DML, UML Data Model Profile, UML Profile for Modeling Data Warehouse Usage |
Enterprise Architect, Visual Paradigm, Oracle SQL Data Modeler |
Microsoft SQL Server, Oracle, MySQL, PostgreSQL, IBM DB2 |
|
Big Data |
Key-Value |
Logical, Physical |
Key, Value |
SQL-DDL, Dynamo Query Language |
|
Dynamo, Voldemort |
Document |
Logical, Physical |
Document, Primary Key |
SQL-DDL, Javascript |
|
MongoDB, CounchDB |
|
Wide-Column |
Logical, Physical |
Keyspace, Table, Column, Column Family, Super Column, Primary Key, Index |
CQL, Groovy |
|
Cassandra, HBase |
|
Graph |
Logical, Physical |
Node, Edge, Property |
Cypher Query Language, SPARQL |
|
Neo4j, AllegroGraph |
On the other hand, in
terms of the Data Analytics Perspective, Table 3 considers six features
of analysis: (1) the class of application domains, which characterizes
the approach suitability; (2) the common operations used in the
approach, which can be reads and/or writes; (3) the operations types
most typically used in the approach; (4) the concrete languages used to
specify those operations; (5) the abstraction level of these concrete
languages (Conceptual, Logical and Physical); and (6) the technology
support of these languages and operations.
Table 3 shows that Big
Data is used in more classes of application domains than the
operational databases and DWs, which are used for OLTP and OLAP domains,
respectively. It is also possible to observe that operational databases
are commonly used for reads and writes of small operations (using
transactions), because they need to handle fresh and critical data in a
daily basis. On the other hand, DWs are mostly suited for read
operations, since they perform analysis and data mining mostly with
historical data. Big Data performs both reads and writes, but in a
different way and at a different scale from the other approaches. Big
Data applications are built to perform a huge amount of reads, and if a
huge amount of writes is needed, like for OLTP, they sacrifice
consistency (using "eventually consistency") in order to achieve great
availability and horizontal scalability. Operational databases support
their data manipulation operations (e.g. select, insert or delete) using
SQL-ML, which has slight variations according to the technology used.
DWs also use SQL-ML through the select statement, because their
operations (e.g. slice, dice or drill down/up) are mostly reads. DWs
also use SQL-based languages, like MDX and XMLA (XML for Analysis),
for specifying their operations. On the other hand, regarding Big Data
technologies, there is a great variety of languages to manipulate data
according to the different class application domains. All of these
languages provide equivalent operations to the ones offered by SQL-ML
and add new constructs for supporting both ETL, data stream processing
(e.g. create stream, window) [34] and MapReduce operations. It is
important to note that concrete languages used in the different
approaches reside at logical and physical levels, because they are
directly used by the supporting software tools.