Data Modeling and Data Analytics

Introduction

We have been witnessing to an exponential growth of the volume of data produced and stored. This can be explained by the evolution of the technology that results in the proliferation of data with different formats from the most various domains (e.g. health care, banking, government or logistics) and sources (e.g. sensors, social networks or mobile devices). We have assisted a paradigm shift from simple books to sophisticated databases that keep being populated every second at an immensely fast rate. Internet and social media also highly contribute to the worsening of this situation. Facebook, for example, has an average of 4.75 billion pieces of content shared among friends every day. Traditional Relational Database Management Systems (RDBMSs) and Data Warehouses (DWs) are designed to handle a certain amount of data, typically structured, which is completely different from the reality that we are facing nowadays. Business is generating enormous quantities of data that are too big to be processed and analyzed by the traditional RDBMSs and DWs technologies, which are struggling to meet the performance and scalability requirements.

Therefore, in the recent years, a new approach that aims to mitigate these limitations has emerged. Companies like Facebook, Google, Yahoo and Amazon are the pioneers in creating solutions to deal with these "Big Data" scenarios, namely recurring to technologies like Hadoop and MapReduce. Big Data is a generic term used to refer to massive and complex datasets, which are made of a variety of data structures (structured, semi- structured and unstructured data) from a multitude of sources. Big Data can be characterized by three Vs: volume (amount of data), velocity (speed of data in and out) and variety (kinds of data types and sources). Still, there are added some other Vs for variability, veracity and value.

Adopting Big Data-based technologies not only mitigates the problems presented above, but also opens new perspectives that allow extracting value from Big Data. Big Data-based technologies are being applied with success in multiple scenarios like in: (1) e-commerce and marketing, where count the clicks that the crowds do on the web allow identifying trends that improve campaigns, evaluate personal profiles of a user, so that the content shown is the one he will most likely enjoy; (2) government and public health, allowing the detection and tracking of disease outbreaks via social media or detect frauds; (3) transportation, industry and surveillance, with real-time improved estimated times of arrival and smart use of resources.

This paper provides a broad view of the current state of this area based on two dimensions or perspectives: Data Modeling and Data Analytics. Table 1 summarizes the focus of this paper, namely by identifying three representative approaches considered to explain the evolution of Data Modeling and Data Analytics. These approaches are: Operational databases, Decision Support databases and Big Data technologies.

This research work has been conducted in the scope of the DataStorm project, led by our research group, which focuses on addressing the design, implementation and operation of the current problems with Big Data- based applications. More specifically, the goal of our team in this project is to identify the main concepts and patterns that characterize such applications, in order to define and apply suitable domain-specific languages (DSLs). Then these DSLs will be used in a Model-Driven Engineering (MDE) approach aiming to ease the design, implementation and operation of such data-intensive applications.

To ease the explanation and better support the discussion throughout the paper, we use a very simple case study based on a fictions academic management system described below:

Case Study - Academic Management System (AMS):

The Academic Management System (AMS) should support two types of end-users: students and professors. Each person has a name, gender, date of birth, ID card, place of origin and country. Students are enrolled in a given academic program, which is composed of many courses. Professors have an academic degree, are associated to a given department and lecture one or more courses. Each course has a name, academic term and can have one or more locations and academic programs associated. Additionally, a course is associated to a schedule composed of many class periods determining its duration and the day it occurs.

The outline of this paper is as follows: Section 2 describes Data Modeling and some representative types of data models used in operational databases, decision support databases and Big Data technologies. Section 3 details the type of operations performed in terms of Data Analytics for these three approaches. Section 4 compares and discusses each approach in terms of the Data Modeling and Data Analytics perspectives. Section 5 discusses our research in comparison with the related work. Finally, Section 6 concludes the paper by summarizing its key points and identifying future work.