Completion requirements
This paper explores what a Database Management System (DBMS) suited to the future may look like based on issues that can be seen today, as well as emerging trends and how this system may be created. An apt example includes a system that allows efficient and continuous querying and mining of data flows that can be employed on media with different computing capacities. What human-to-machine communication and interoperability do you think was most beneficial? Consider how, for example, an individual embedded medical device will be included in DBMS as processes get more complex and storage facilities become more distributed. What are some key aspects of DBMS that could benefit future architectures?
3. Intensive Big Data Management: Bringing Scalability to Data Management
A consistent challenge is that real Big Data clusters involve multi-user, multi-class (i.e., mixed job size) workloads not just one analysis job at a time or a few analysis jobs at a time, but a mix of jobs and queries of widely varying importance and sizes. The next generation of Big Data management services will have to propose approaches for dealing with such workloads effectively: a long-standing open problem in the parallel database world and in the Hadoop world. We believe that three aspects are key for dealing with intensive Big Data management: (i) dealing with efficient simple storage systems that focus efficient read and write operations; (ii) multiple storage spaces given the volume and variety properties of data collections; (iii) data analytics workflows supported by efficient underlying data management infrastructures; (iv) parallel programming models to deal with the execution of greedy data analytic tasks that might require important computing and memory resources; (v) cloud providers that can provide storage, computing and memory resources in an elastic and "transparent" way, and that can enable the execution of greedy processes running on top of data collections hosted on their storage services.
New kinds of data with specific structures (e.g., documents, graphs) produced by sensors, Global Positioning Systems (GPS), automated trackers and monitoring systems has to be manipulated, analyzed, and archived. These large volumes of data sets impose new challenges and opportunities around storage, analysis, and archival. NoSQL stores seem to be appropriate systems that claim to be simpler, faster, and more reliable. This means that traditional data management techniques around upfront schema definition and relational references are being questioned.
Even if there is no standard definition of what NoSQL means, there are common characteristics of these systems: (i) they do not rely on the relational model and do not use the SQL language; (ii) they tend to run on cluster architectures; (iii) they do not have a fixed schema, allowing to store data in any record. The systems that fall under the NoSQL umbrella are quite varied, each with their unique sets of features and value propositions. Examples include MongoDB, CouchDB, Cassandra, Hbase, and also BigTable and SimpleDB, which fit in general operating characteristics.
Despite the profound differences among the different NoSQL systems, the common characteristic with respect to the architecture is that the external and logic levels of RDBMS disappear. This means that the applications are close to the physical level with very few independence program and data. Data processing functions like querying, aggregating, analyzing are conceived for ensuring efficiency. For example, Google's Bigtable adopts a column-oriented data model avoiding consuming space when storing nulls by simply not storing a column when a value does not exist for that column. Columns are capable of storing any data types as far as the data can be persisted in the form of an array of bytes. The sorted ordered structure makes data seek by row-key extremely efficient. Data access is less random and ad hoc, and lookup is as simple as finding the node in the sequence that holds the data. Data are inserted at the end of the list.
Another evidence of the proximity to the physical model exploited by NoSQL systems are key-value stores that exploit hash-map (associative array) for holding key-value pairs. The structure is popular because thereby stores provide a very efficient O(1) average algorithm running time for accessing data. The key of a key-value pair is a unique value in the set and can be easily looked up to access the data. Key-value pairs are of varied types: some keep the data in memory and some provide the capability to persist the data to disk. The underlying data storage architecture is in general a cluster, and the execution model of data processing functions is Map-Reduce. Thus data are, in general, managed in cache using for example the memcached protocol particularly popular in key-value stores. A cache provides an in-memory snapshot of the most-used data in an application. The purpose of cache is to reduce disk I/O.
In some situations, availability cannot be compromised, and the system is so distributed that partition tolerance is required. In such cases, it may be possible to compromise strong consistency. The counterpart of strong consistency is weak consistency. Inconsistent data are probably not a choice for any serious system that allows any form of data updates but eventual consistency could be an option. Eventual consistency alludes to the fact that after an update all nodes in the cluster see the same state eventually. If the eventuality can be defined within certain limits, then the eventual consistency model could work. The term BASE (Basically Available Soft-state Eventually) denotes the case of eventual consistency.
NoSQL systems promote performance, scalable, clustered oriented data management and schema-less data design focusing on data distribution, duplication and on demand persistence. The logic and external levels of the classic DBMS architecture are erased exposing the physical level to applications with certain transparency. The application describes its data structures that can be persistent, and that can be retrieved using indexing mechanisms well adapted to these structures. Assuming good amounts of available memory resources, they promote parallel data querying including data processing tasks. Data availability is ensured through replication techniques and persistence on second memory is done on demand. For the time being, given that data management is done at the physical level, and that there is few data - program independence, there is a lot of programming burden to be undertaken by application programmers. Most NoSQL systems do not support high-level query languages with built-in query optimization. Instead, they expect the application programmer to worry about optimizing the execution of their data access calls with joins or similar operations having to be implemented in the application. Another drawback mainly associated to the historical moment is that application programming interfaces (APIs) are not yet standardized, thus standardized bindings are missing, and they have to be programmed and maintained.
NoSQL systems overcome some of the shortcomings of the relational systems but leave aside good principles of the RDBMS, which go beyond the relational model and the SQL language. The schema-less approach seems to respond to a schema evolution requirement stemming from applications dealing with data in a very simple way (read/write operations). Web data like logs of activities in an e-commerce site or data managed by social media applications like Facebook are examples of cases needing schema evolutions because data are not very structured and, even when it is structured, the structure changes a lot over time.
The use of heterogeneous data stores within a single system is gradually becoming a common practice in application development. Modern applications tend to rely on a polyglot approach to persistence, where traditional databases, non-relational data stores, and scalable systems associated with the emerging NewSQL movement, are used simultaneously.
As part of the emerging polyglot persistence movement, the simultaneous use of multiple scalable SQL, NoSQL, and NewSQL data stores within a single system is gradually becoming a common practice in modern application development. Nonetheless, the combination of these heterogeneous databases, flexible schemata, and non-standard APIs represents an added complexity for application developers. For example, considering that the schemata used by these applications are spread across multiple data stores, each of which possibly relies on distinct data models (such as key-value, document, graph, etc.), developers must be familiar with a high number of implementation details, in order to effectively work with, and maintain the overall database model.
Due to the nature of schema-less data stores, developers also need to provide an adequate maintenance of the implicit schemata that these applications rely upon. This is due to the fact that the source code generally contains assumptions about the data structures used by the application (such as field names, types, etc.), even if the data stores themselves do not enforce any particular schema. Essentially, we consider that the schemata are shifted from the database to the application source code. This is because many Web applications need to deal with persistent objects that maintain data about their execution state. However, having the data schemata as part of the application code can lead to maintenance and performance issues. For instance, developers have to manually analyze the full source code in order to effectively understand the data structures used by these applications. This can be an error-prone activity, due to the combination of different programming styles, APIs, and development environments.
Together with other approaches in the domain and in industry, we propose an approach and tool named ExSchema that enables the automatic discovery of schemata from polyglot persistence applications. The discovery mechanism is based on source code analysis techniques, particularly on API usage and on the analysis of standard data layer patterns. The tool receives application source code as input, containing invocations to the APIs of one or more data stores (graph, key-value, relational, and column). The ExSchema analyzers examine this application source code and share their analysis results between each other. For example, in order to identify schema update methods, we first identify the declaration of variables. The schema fragments recovered by the code analyzers are grouped together, according to the data models supported by our tool, and by extending the translation mechanisms, with the identification of relationships between graph entities. The discovered schemata are represented using a set of meta-layer constructs, and finally, this meta-layer representation is transformed into a PDF image, and a set of Spring Roo scripts. This means that if, for example, the analyzed application relies on graph and document data stores, our tool generates two schemata, one for each data model. Both schemata are depicted in a unique PDF image and two Spring Roo scripts are generated, one for each schema. Another example of data collection schema inference is present in current SPARK SQL functions. In this platform schemata of DataFrames and Data Sets can be inferred, extracted or explicitly defined using built-in operations.
Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples. Big Data are often noisy, dynamic, heterogeneous, interrelated and untrustworthy. Nevertheless, even noisy Big Data could be more valuable than tiny samples. Indeed, general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge.
Big Data forces to view data mathematically (e.g., measures, values distribution) first and establish a context for it later. For instance, how can researchers use statistical tools and computer technologies to identify meaningful patterns of information? How shall significant data correlations be interpreted? What is the role of traditional forms of scientific theorizing and analytic models in assessing data? What you really want to be doing is looking at the whole data set in ways that tell you things and answer questions that you are not asking. All these questions call for well-adapted infrastructures that can efficiently organize data, evaluate and optimize queries, and execute algorithms that require important computing and memory resources.
Big Data are enabling the next generation of interactive data analysis with real-time answers. In the future, queries toward Big Data will be automatically generated for content creation on Web sites, to populate hot-lists or recommendations, and to provide an ad hoc analysis of data sets to decide whether to keep or to discard them. Scaling complex query processing techniques to terabytes while enabling interactive response times is a major open research problem today.
The share-nothing design moves only questions and answers through the network. Raw memory accesses and raw disk accesses are performed locally in a processor, and only the filtered (reduced) data are passed to the client program. This allows a more scalable design by minimizing traffic on the network. The main advantage of shared-nothing multi-processors is that they can be scaled up to hundreds and probably thousands of processors that do not interfere with one another. Twenty years later, Google's technical response to the challenges of Web-scale data management and analysis was the Google File System (GFS). To handle the challenge of processing the data in such large files, Google pioneered its Map-Reduce programming model and platform. This model enabled Google's developers to process large collections of data by writing two user-defined functions, map and reduce, that the Map-Reduce framework applies to the instances (map) and sorted groups of instances that share a common key (reduce) similar to the sort of partitioned parallelism utilized in shared-nothing parallel query processing.
Yahoo!, Facebook, and other large Web companies followed. Taking Google's GFS and Map-Reduce papers as rough technical specifications, open-source equivalents were developed, and the Apache Hadoop Map-Reduce platform, and its underlying file system HDFS emerged. Microsoft's technologies include a parallel runtime system called Dryad, and two higher-level programming models, DryadLINQ and the SQL-like SCOPE language. The Hadoop community developed a set of higher-level declarative languages for writing queries and data analysis pipelines that are compiled into Map-Reduce jobs, and then executed on the Hadoop MapReduce platform. Popular languages include Pig from Yahoo!, Jaql from IBM, and Hive from Facebook. Pig is relational-algebra-like in nature, and is reportedly used for over 60% of Yahoo!'s Map-Reduce use cases; Hive is SQL-inspired and reported to be used for over 90% of the Facebook Map-Reduce use cases.
Recent works agree on the need to study the Map-Reduce model for identifying its limitations and pertinence for implementing data processing algorithms like relational operators (i.e., join). Other platforms oriented to dataflows like Spark propose alternatives to data processing requiring computing resources and also Storm and Flink for dealing with streams (i.e., Big data velocity). New research opportunities are open in the database domain for studying different Map-Reduce models and proposing parallel programming strategies for accessing data that will consider the characteristics of different architectures like the cloud and its economic model and the QoS requirements of applications, and other architectures like clusters, HPC, grid and GPU.
NoSQL Data Store Managers
Even if there is no standard definition of what NoSQL means, there are common characteristics of these systems: (i) they do not rely on the relational model and do not use the SQL language; (ii) they tend to run on cluster architectures; (iii) they do not have a fixed schema, allowing to store data in any record. The systems that fall under the NoSQL umbrella are quite varied, each with their unique sets of features and value propositions. Examples include MongoDB, CouchDB, Cassandra, Hbase, and also BigTable and SimpleDB, which fit in general operating characteristics.
Despite the profound differences among the different NoSQL systems, the common characteristic with respect to the architecture is that the external and logic levels of RDBMS disappear. This means that the applications are close to the physical level with very few independence program and data. Data processing functions like querying, aggregating, analyzing are conceived for ensuring efficiency. For example, Google's Bigtable adopts a column-oriented data model avoiding consuming space when storing nulls by simply not storing a column when a value does not exist for that column. Columns are capable of storing any data types as far as the data can be persisted in the form of an array of bytes. The sorted ordered structure makes data seek by row-key extremely efficient. Data access is less random and ad hoc, and lookup is as simple as finding the node in the sequence that holds the data. Data are inserted at the end of the list.
Another evidence of the proximity to the physical model exploited by NoSQL systems are key-value stores that exploit hash-map (associative array) for holding key-value pairs. The structure is popular because thereby stores provide a very efficient O(1) average algorithm running time for accessing data. The key of a key-value pair is a unique value in the set and can be easily looked up to access the data. Key-value pairs are of varied types: some keep the data in memory and some provide the capability to persist the data to disk. The underlying data storage architecture is in general a cluster, and the execution model of data processing functions is Map-Reduce. Thus data are, in general, managed in cache using for example the memcached protocol particularly popular in key-value stores. A cache provides an in-memory snapshot of the most-used data in an application. The purpose of cache is to reduce disk I/O.
In some situations, availability cannot be compromised, and the system is so distributed that partition tolerance is required. In such cases, it may be possible to compromise strong consistency. The counterpart of strong consistency is weak consistency. Inconsistent data are probably not a choice for any serious system that allows any form of data updates but eventual consistency could be an option. Eventual consistency alludes to the fact that after an update all nodes in the cluster see the same state eventually. If the eventuality can be defined within certain limits, then the eventual consistency model could work. The term BASE (Basically Available Soft-state Eventually) denotes the case of eventual consistency.
NoSQL systems promote performance, scalable, clustered oriented data management and schema-less data design focusing on data distribution, duplication and on demand persistence. The logic and external levels of the classic DBMS architecture are erased exposing the physical level to applications with certain transparency. The application describes its data structures that can be persistent, and that can be retrieved using indexing mechanisms well adapted to these structures. Assuming good amounts of available memory resources, they promote parallel data querying including data processing tasks. Data availability is ensured through replication techniques and persistence on second memory is done on demand. For the time being, given that data management is done at the physical level, and that there is few data - program independence, there is a lot of programming burden to be undertaken by application programmers. Most NoSQL systems do not support high-level query languages with built-in query optimization. Instead, they expect the application programmer to worry about optimizing the execution of their data access calls with joins or similar operations having to be implemented in the application. Another drawback mainly associated to the historical moment is that application programming interfaces (APIs) are not yet standardized, thus standardized bindings are missing, and they have to be programmed and maintained.
NoSQL systems overcome some of the shortcomings of the relational systems but leave aside good principles of the RDBMS, which go beyond the relational model and the SQL language. The schema-less approach seems to respond to a schema evolution requirement stemming from applications dealing with data in a very simple way (read/write operations). Web data like logs of activities in an e-commerce site or data managed by social media applications like Facebook are examples of cases needing schema evolutions because data are not very structured and, even when it is structured, the structure changes a lot over time.
Dealing with Multiple Storage Spaces
As part of the emerging polyglot persistence movement, the simultaneous use of multiple scalable SQL, NoSQL, and NewSQL data stores within a single system is gradually becoming a common practice in modern application development. Nonetheless, the combination of these heterogeneous databases, flexible schemata, and non-standard APIs represents an added complexity for application developers. For example, considering that the schemata used by these applications are spread across multiple data stores, each of which possibly relies on distinct data models (such as key-value, document, graph, etc.), developers must be familiar with a high number of implementation details, in order to effectively work with, and maintain the overall database model.
Due to the nature of schema-less data stores, developers also need to provide an adequate maintenance of the implicit schemata that these applications rely upon. This is due to the fact that the source code generally contains assumptions about the data structures used by the application (such as field names, types, etc.), even if the data stores themselves do not enforce any particular schema. Essentially, we consider that the schemata are shifted from the database to the application source code. This is because many Web applications need to deal with persistent objects that maintain data about their execution state. However, having the data schemata as part of the application code can lead to maintenance and performance issues. For instance, developers have to manually analyze the full source code in order to effectively understand the data structures used by these applications. This can be an error-prone activity, due to the combination of different programming styles, APIs, and development environments.
Together with other approaches in the domain and in industry, we propose an approach and tool named ExSchema that enables the automatic discovery of schemata from polyglot persistence applications. The discovery mechanism is based on source code analysis techniques, particularly on API usage and on the analysis of standard data layer patterns. The tool receives application source code as input, containing invocations to the APIs of one or more data stores (graph, key-value, relational, and column). The ExSchema analyzers examine this application source code and share their analysis results between each other. For example, in order to identify schema update methods, we first identify the declaration of variables. The schema fragments recovered by the code analyzers are grouped together, according to the data models supported by our tool, and by extending the translation mechanisms, with the identification of relationships between graph entities. The discovered schemata are represented using a set of meta-layer constructs, and finally, this meta-layer representation is transformed into a PDF image, and a set of Spring Roo scripts. This means that if, for example, the analyzed application relies on graph and document data stores, our tool generates two schemata, one for each data model. Both schemata are depicted in a unique PDF image and two Spring Roo scripts are generated, one for each schema. Another example of data collection schema inference is present in current SPARK SQL functions. In this platform schemata of DataFrames and Data Sets can be inferred, extracted or explicitly defined using built-in operations.
Data Analytics
Big Data forces to view data mathematically (e.g., measures, values distribution) first and establish a context for it later. For instance, how can researchers use statistical tools and computer technologies to identify meaningful patterns of information? How shall significant data correlations be interpreted? What is the role of traditional forms of scientific theorizing and analytic models in assessing data? What you really want to be doing is looking at the whole data set in ways that tell you things and answer questions that you are not asking. All these questions call for well-adapted infrastructures that can efficiently organize data, evaluate and optimize queries, and execute algorithms that require important computing and memory resources.
Big Data are enabling the next generation of interactive data analysis with real-time answers. In the future, queries toward Big Data will be automatically generated for content creation on Web sites, to populate hot-lists or recommendations, and to provide an ad hoc analysis of data sets to decide whether to keep or to discard them. Scaling complex query processing techniques to terabytes while enabling interactive response times is a major open research problem today.
Analytical pipelines can often involve multiple steps, with built-in assumptions. By studying how best to capture, store, and query provenance, it is possible to create an infrastructure to interpret analytical results and to repeat the analysis with different assumptions, parameters, or data sets. Frequently, it is data exploration and visualization that allow Big Data to unleash its true impact. Visualization can help to produce and comprehend insights from Big Data. Visually, Tableau, Vizify, D3.js, R, are simple and powerful tools for quickly discovering new things in increasingly large datasets.
Parallel Model for Implementing Data Processing Functions
A consensus on parallel and distributed database system architecture emerged in the 1990's. This architecture was based on a shared-nothing hardware design in which processors communicate with one another only by sending messages via a network. In such systems, tuples of each relation in the database were partitioned (declustered) across disk storage units attached directly to each processor. Partitioning allowed multiple processors to scan large relations in parallel without the need for any exotic I/O devices. Such architectures were pioneered by Teradata in the late seventies, and by several research projects. This design is used by Teradata, Tandem, NCR, Oracle-nCUBE, and several other products. The research community adopted this shared-nothing dataflow architecture in systems like Arbre, Bubba, and Gamma.The share-nothing design moves only questions and answers through the network. Raw memory accesses and raw disk accesses are performed locally in a processor, and only the filtered (reduced) data are passed to the client program. This allows a more scalable design by minimizing traffic on the network. The main advantage of shared-nothing multi-processors is that they can be scaled up to hundreds and probably thousands of processors that do not interfere with one another. Twenty years later, Google's technical response to the challenges of Web-scale data management and analysis was the Google File System (GFS). To handle the challenge of processing the data in such large files, Google pioneered its Map-Reduce programming model and platform. This model enabled Google's developers to process large collections of data by writing two user-defined functions, map and reduce, that the Map-Reduce framework applies to the instances (map) and sorted groups of instances that share a common key (reduce) similar to the sort of partitioned parallelism utilized in shared-nothing parallel query processing.
Yahoo!, Facebook, and other large Web companies followed. Taking Google's GFS and Map-Reduce papers as rough technical specifications, open-source equivalents were developed, and the Apache Hadoop Map-Reduce platform, and its underlying file system HDFS emerged. Microsoft's technologies include a parallel runtime system called Dryad, and two higher-level programming models, DryadLINQ and the SQL-like SCOPE language. The Hadoop community developed a set of higher-level declarative languages for writing queries and data analysis pipelines that are compiled into Map-Reduce jobs, and then executed on the Hadoop MapReduce platform. Popular languages include Pig from Yahoo!, Jaql from IBM, and Hive from Facebook. Pig is relational-algebra-like in nature, and is reportedly used for over 60% of Yahoo!'s Map-Reduce use cases; Hive is SQL-inspired and reported to be used for over 90% of the Facebook Map-Reduce use cases.
Recent works agree on the need to study the Map-Reduce model for identifying its limitations and pertinence for implementing data processing algorithms like relational operators (i.e., join). Other platforms oriented to dataflows like Spark propose alternatives to data processing requiring computing resources and also Storm and Flink for dealing with streams (i.e., Big data velocity). New research opportunities are open in the database domain for studying different Map-Reduce models and proposing parallel programming strategies for accessing data that will consider the characteristics of different architectures like the cloud and its economic model and the QoS requirements of applications, and other architectures like clusters, HPC, grid and GPU.