3. Some Methods, Challenges and Technical Progress in Big Data Analytics for Disparate Data

A complete data value chain starts at data acquisition; followed by curation, storage, analysis, and usage. Activities and technical challenges along the data value chain are listed in Table 3, but this is not an all-inclusive list. Methods that are computationally efficient and able to handle heterogeneous data types (e.g., discrete counts, continuous values, etc.) and complex data structures (e.g., networks, trees, or domain-specific structures) are increasingly necessary. One major challenge is the integration of heterogeneous big data from different sources.

Table 3. Technical Challenges along the Data Value Chain

Data Acquisition Data Curation Data Storage Data Analysis Data Usage
  • Sensor networks
  • Stream data
  • Unstructured data
  • Protocols
  • Data variety
  • Annotation
  • Data quality
  • Interoperability
  • Data validation
  • In-memory DBs
  • NoSQL DBs
  • Cloud storage
  • Security and Privacy
  • Stream mining
  • Sematic analysis
  • Information extraction
  • Cross-sectional analysis
  • Visualization
  • In-use analytics
  • Prediction
  • Decision support

The capability of searching and navigating among different data forms can be improved by arranging data with different structures into a common schema. A system can relate structured data, semi-structured data, and unstructured data through a template of organization based on the common schema. Several types of heterogeneity can typically be grouped together and a potential approach to reducing (totally or partially) heterogeneity between data sources is similarity matching. Semi-structured data can also be transformed into a type with a predefined relational structure.

Two processes are necessary to integrate data from multiple heterogeneous sources: data migration and data integration. Data migration is defined by retrieval and collection of data from their sources and storing this data in a specified format within a third data source. Data is then contracted and shared as feed data within a format such as Really Simple Syndication (RSS) and/or Resource Description Framework (RDF), which are either JSON or XML data. This collected data needs to be integrated into a database (DB), however, through conversion of the data into a suitable format of the DB. Data integration often includes two procedures. The first procedure includes determining if the data exists in the DB and then updating the data; the second one is a process of elimination or combination of the duplicates found in the heterogeneous data. If the data size is more than tera- or peta-bytes, Hadoop systems can help store and handle the big data.

Higher-level data analytics can be conducted either within a database or within an in-memory database. In-database processing includes analytical functions such as statistical analysis, text mining, data mining, and online analytical processing (OLAP). In-memory capabilities include high speed query processing, OLAP, and results caching. Data processing at a lower level can be conducted to support data ingestion, analytical processing, or other functions such as data cleaning and discovery processes. S4 (Simply Scalable Streaming System) is a distributed and general-purpose platform that is used to develop applications for processing stream data. Storm is an open-source framework for distributed, robust, and real-time computation on stream data.

There are some matured approaches and tools in natural language processing (NLP) that can be used in handling unstructured data. The SmartWeb Ontology-Based Annotation (SOBA), a system for ontology-based information extraction from heterogeneous sources (including tables, plain text, and image captions.) has been designed, implemented, and evaluated. SOBA can process structured data and unstructured data to extract information and integrate it into a coherent knowledge base. SOBA interlinks the information extracted from heterogeneous sources to create coherence as well as identifies duplicates.

One of important steps in the data analytics of heterogeneous mixture data is breaking up inherent heterogeneous mixture properties by putting the data in groups. Each group has the same rules or patterns. There is a large number of possibilities for the data grouping options; therefore, it is difficult to verify each candidate. Three important issues related to grouping the data are as follows: 1) the number of groups; 2) the method of grouping; and 3) a suitable choice of prediction models based on the features of an individual group.

General approaches to data integration set data into a common predetermined schema, or data model. The data lake is a relatively new method that relaxes standardization, which results in a higher potential for data discovery and operational insight. Data lakes help fix problems in data integration and data accessibility. The data lake is also an emerging method for cloud-based big data. The features of data lakes include the following:

  • Size and low cost: They are big and can be an order of magnitude less expensive.
  • Ease of accessibility: This is a benefit of keeping the data in its original form.
  • Fidelity: Hadoop data lakes keep data in its original form and capture data changes and contextual semantics throughout its lifecycle.

Compared with data warehouse systems that have a relational view of data, data lakes handle more heterogeneous data sources such as semi-structured and unstructured sources. A data lake system, Constance, has been developed which offers advanced metadata management over raw data which is extracted from heterogeneous sources. Regardless of the formats of source data (e.g., spreadsheets, relational data, XML, or JSON), Constance loads and stores the data in its original format without costly transformation procedures unlike the traditional ETL (Extract, Transform, Load) process. Data lakes have been conceptualized as repositories for big data and this kind of repositories can store raw data and have the functionality for on-demand integration.

There has been only limited research on the representation of heterogeneous big data from multi-sources, distributed storage of energy efficiency optimization, semantic comprehension methods, and processed hardware and software system architectures, etc. Researchers should also conduct further study of big data security such as completeness maintenance, credibility, and backup and recovery.