Data Modeling and Data Analytics
Introduction
We have been witnessing to an exponential growth of
the volume of data produced and stored. This can be explained by the
evolution of the technology that results in the proliferation of data
with different formats from the most various domains (e.g. health care,
banking, government or logistics) and sources (e.g. sensors, social
networks or mobile devices). We have assisted a paradigm shift from
simple books to sophisticated databases that keep being populated every
second at an immensely fast rate. Internet and social media also highly
contribute to the worsening of this situation. Facebook, for
example, has an average of 4.75 billion pieces of content shared among
friends every day. Traditional Relational Database Management
Systems (RDBMSs) and Data Warehouses (DWs) are designed to handle a
certain amount of data, typically structured, which is completely
different from the reality that we are facing nowadays. Business is
generating enormous quantities of data that are too big to be processed
and analyzed by the traditional RDBMSs and DWs technologies, which are
struggling to meet the performance and scalability requirements.
Therefore,
in the recent years, a new approach that aims to mitigate these
limitations has emerged. Companies like Facebook, Google, Yahoo and
Amazon are the pioneers in creating solutions to deal with these "Big
Data" scenarios, namely recurring to technologies like Hadoop
and MapReduce. Big Data is a generic term used to refer to massive
and complex datasets, which are made of a variety of data structures
(structured, semi- structured and unstructured data) from a multitude of
sources. Big Data can be characterized by three Vs: volume (amount
of data), velocity (speed of data in and out) and variety (kinds of
data types and sources). Still, there are added some other Vs for
variability, veracity and value.
Adopting Big Data-based
technologies not only mitigates the problems presented above, but also
opens new perspectives that allow extracting value from Big Data. Big
Data-based technologies are being applied with success in multiple
scenarios like in: (1) e-commerce and marketing, where
count the clicks that the crowds do on the web allow identifying trends
that improve campaigns, evaluate personal profiles of a user, so that
the content shown is the one he will most likely enjoy; (2) government
and public health, allowing the detection and tracking of disease
outbreaks via social media or detect frauds; (3) transportation,
industry and surveillance, with real-time improved estimated times of
arrival and smart use of resources.
This paper provides a broad
view of the current state of this area based on two dimensions or
perspectives: Data Modeling and Data Analytics. Table 1 summarizes the
focus of this paper, namely by identifying three representative
approaches considered to explain the evolution of Data Modeling and Data
Analytics. These approaches are: Operational databases, Decision
Support databases and Big Data technologies.
This research work
has been conducted in the scope of the DataStorm project, led by
our research group, which focuses on addressing the design,
implementation and operation of the current problems with Big Data-
based applications. More specifically, the goal of our team in this
project is to identify the main concepts and patterns that characterize
such applications, in order to define and apply suitable domain-specific
languages (DSLs). Then these DSLs will be used in a Model-Driven
Engineering (MDE) approach aiming to ease the design,
implementation and operation of such data-intensive applications.
To
ease the explanation and better support the discussion throughout the
paper, we use a very simple case study based on a fictions academic
management system described below:
Case Study - Academic Management System (AMS):
The Academic Management System (AMS) should support two types of end-users: students and professors. Each person has a name, gender, date of birth, ID card, place of origin and country. Students are enrolled in a given academic program, which is composed of many courses. Professors have an academic degree, are associated to a given department and lecture one or more courses. Each course has a name, academic term and can have one or more locations and academic programs associated. Additionally, a course is associated to a schedule composed of many class periods determining its duration and the day it occurs. |
The outline of this paper is
as follows: Section 2 describes Data Modeling and some representative
types of data models used in operational databases, decision support
databases and Big Data technologies. Section 3 details the type of
operations performed in terms of Data Analytics for these three
approaches. Section 4 compares and discusses each approach in terms of
the Data Modeling and Data Analytics perspectives. Section 5 discusses
our research in comparison with the related work. Finally, Section 6
concludes the paper by summarizing its key points and identifying future
work.