This paper explores what a Database Management System (DBMS) suited to the future may look like based on issues that can be seen today, as well as emerging trends and how this system may be created. An apt example includes a system that allows efficient and continuous querying and mining of data flows that can be employed on media with different computing capacities. What human-to-machine communication and interoperability do you think was most beneficial? Consider how, for example, an individual embedded medical device will be included in DBMS as processes get more complex and storage facilities become more distributed. What are some key aspects of DBMS that could benefit future architectures?
4. Big Data Analytics Systems
The emergence of Big Data some years ago denoted the challenge of dealing with huge collections of heterogeneous data continuously produced and to be exploited through data analytics processes. First approaches have addressed data volume and processing scalability challenges. Solutions can be described as balancing delivery of physical services such as: (i) hardware (computing, storage and memory); (ii) communication (bandwidth and reliability) and scheduling; (iii) greedy analytics and mining processes with high in-memory and computing cycles requirements. The next sections describe different systems approaches that provide solutions for dealing with Big Data: analytics stacks, distributed data persistence solutions, cloud data management services and parallel runtime environments.
Big Data Analytics Stacks
Due to their democratization, Big Data management and processing are no longer only associated to scientific applications with prediction, analytics requirements. Artificial intelligence algorithms requirements also call for Big Data aware management related to the understanding and automatic control of complex systems, to decision making in critical and non-critical situations. Therefore, new data analytics stacks have emerged as environments that provide the necessary underlying infrastructure for giving access to data collections and implementing data processing workflows to transform them and execute data analytics operations (statistics, data mining, knowledge discovery, computational science processes) on top of them.
One of the most prominent ones are Berkeley Data Analytics Stack (BDAS) from the AMPLAb project in Berkeley. BDAS is a multi-layered architecture that provides tools for virtualizing resources, addressing storage, data processing and querying as underlying tools for Big Data aware applications. Another important Big Data stack system is AsterixDB from the Asterix project. AsterixDB is a scalable, open source Big Data Management System (BDMS).
Data lake environments also deal with Big Data management and analytics through integrated environments designed as toolkits. A data lake is a shared data environment consisting of multiple repositories. It provides data to an organization for a variety of analytics processing including discovery and exploration of data, simple ad hoc analytics, complex analysis for business decisions, reporting, real-time analytics. Industrial solutions are in the market today, such as Microsoft Azure Data Lake, IBM, and Teradata.
Distributed Data Persistence Solutions
Data reads and writes in many data analytics workflows are guided by the RUM conjecture (Read, Update, Memory (or storage) overhead) that characterizes the challenge of reducing overhead data being read, updated and stored (in memory, cache or disk). Several platforms address some aspect of the problem like Big Data stacks; data processing environments (e.g., Hadoop, Spark, CaffeonSpark); data stores dealing with the CAP (consistency, atomicity and partition tolerance) theorem (e.g., NoSQL's); and distributed file systems (e.g., HDFS). The principle is to define API's (application programming interface) to be used by programs to interact with distributed data storage layers that can cope with distributed and parallel architectures.
In the distributed systems domain objects persistence has been an important issue addressed already by consolidated middleware such as JBOSS and PiJAMA. The new exascale requirements introduced by greedy processes often related to Big Data processing has introduced objects persistence again. In order for exascale and/or Big Data systems to deliver the needed I/O performance, new storage devices such as NVRAM or Storage Class Memories (SCM) need to be included into the storage/memory hierarchy. Given that the nature of these new devices will be closer to memory than to storage (low latencies, high bandwidth, and byte-addressable interface) using them as block devices for a file system does not seem to be the best option. DataClay, proposes object storage to enable both the programmer, and DataClay, to take full advantage of the coming high-performance and byte-addressable storage devices. Today, given the lack of such devices, DataClay performs a mapping of such abstractions to key-value stores such as Kinetic drives from Seagate.
Data structures and associated functions are sometimes more important for some requirements rather than non-functional properties like RUM or CAP. Non-relational databases have emerged as solutions when dealing with huge data sets and massive query work load. These systems have been redesigned to achieve scalability and availability at the cost of providing only a reduced set of low-level data management functions, thus forcing the client application to take care of complex logic. Existing approaches like Hecuba, Model2Roo provide tools and interfaces, to ensure an efficient and global interaction with non-relational technologies.
The large spectrum of data persistence and management solutions are adapted for addressing workloads associated with Big Data volumes; and either simple read write operations or with more complex data processing tasks. The challenge today is choosing the right data management combination of tools for variable application requirements and architecture characteristics. Plasticity of solutions is from our point of view the most important property of such tools combination.
Cloud Data Management Services
Cloud computing is emerging as a relatively new approach for dealing with and facilitating unlimited access to computing and storage resources for building applications. The underlying infrastructure manages such resources transparently without including code in the application for managing and reserving more resources than those really required. The difference with classic approaches is that the application can have an ad hoc execution context, and that the resources it consumes are not necessarily located in one machine. Thanks to the cloud properties, applications can have ad hoc execution contexts. Following the same approach, database management systems functions can be delivered as services that must be tuned and composed for efficiently and inexpensively managing, querying and exploiting huge data sets.
Cloud architectures provide services at different scales and add constraints for accessing data for instance, access control, resources reservation, and assignment using priorities (e.g., in grid architectures) and economic cost (e.g., in the cloud). Applications deployed in these architectures specify QoS preferences (SLA contracts) that include execution and processing time, data pertinence and provenance, economic cost, and data processing energy consumption cost.
Thus data management must be revisited for designing strategies that couple the characteristics of novel architectures with users' preferences. In this context we identify three key scientific challenges: (i) data (flows) access and processing guided by SLA contracts, where data are produced by services and devices connected on heterogeneous networks; (ii) estimation and reduction in temporal, economic and energy consumption cost for accessing and processing data; (iii) optimization of data processing guided by SLA contracts expressed using cost models as reference.
ParAllel Runtime Environments
Today maybe because of the emergence of Big Data and greedy algorithms and applications requiring computing resources, parallel architectures have come back in the arena. There are different kinds of computing, memory and storage resources providers that adopt their own method for delivering such resources for executing programs. There are three categories of resources provision: (i) Platform-as-a-Service (PaaS) frameworks, (ii) programming models for computing intensive workloads and (iii) programming models for Big Data.
PaaS offer APIs to write applications. For example, in the Microsoft Azure Cloud programming model applications are structured according to roles, which use APIs to communicate (queues) and to access persistent storage (blobs and tables). Microsoft Generic Worker proposes a mechanism to develop a Worker Role that eases the porting of legacy code in the Azure platform. Google App Engine provides libraries to invoke external services and queue units of work (tasks) for execution; furthermore, it allows to run applications programmed in the Map-Reduce model. Data transfer and synchronization are handled automatically by the runtime. Environments for computing workload intensive applications use in general the bag of tasks execution model conceiving an application as composed of independent parallel tasks. For example, the Cloud BigJob, Amazon EC2, Eucalyptus and Nimbus Clouds and ProActive that offers a resource manager developed to mix Cloud and Grid resources. Map-Reduce programming is maybe the most prominent programming model for data intensive applications. Map-Reduce-based runtime environments provide good performance on cloud architectures above all on data analytics tasks working on large data collections. Microsoft Daytona proposes an iterative Map-Reduce runtime for Windows Azure to support data analytics and machine learning algorithms. Twister is an enhanced Map-Reduce runtime with an extended programming model for iterative Map-Reduce computations. Hadoop is the most popular open source implementation of Map-Reduce on top of HDFS, as said in the previous section. The use of Hadoop avoids the lock into a specific platform allowing to execute the same Map-Reduce application on any Hadoop compliant service, as the Amazon Elastic Map-Reduce.