There are some common issues when dealing with big data. Two critical ones are data quality and data variety (such as multiple formats within the dataset) – deep learning techniques, such as dimension reduction, can be used to solve these problems. Traditional data models and machine learning methods struggle to deal with these data issues, further supporting the case of deep learning, as the former cannot handle complex data with the framework of Big Data.
Using the 7Vs characteristics of big data, assign an issue you recognize in your industry to each and think of possible solutions. Additionally, write down the positives and if some of these key points could be added elsewhere in your industry in a different manner.3. Some Methods, Challenges and Technical Progress in Big Data Analytics for Disparate Data
A complete data value chain starts at data acquisition; followed by curation, storage, analysis, and usage. Activities and technical challenges along the data value chain are listed in Table 3, but this is not an all-inclusive list. Methods
that are computationally efficient and able to handle heterogeneous data types (e.g., discrete counts, continuous values, etc.) and complex data structures (e.g., networks, trees, or domain-specific structures) are increasingly necessary. One major
challenge is the integration of heterogeneous big data from different sources.
Table 3. Technical Challenges along the Data Value Chain
Data Acquisition | Data Curation | Data Storage | Data Analysis | Data Usage |
|
|
|
|
|
The capability of searching and navigating among different data forms can be improved by arranging data with different structures into a common schema. A system can relate structured data, semi-structured data, and unstructured data through a template of organization based on the common schema. Several types of heterogeneity can typically be grouped together and a potential approach to reducing (totally or partially) heterogeneity between data sources is similarity matching. Semi-structured data can also be transformed into a type with a predefined relational structure.
Two processes are necessary to integrate data from multiple heterogeneous sources: data migration and data integration. Data migration is defined by retrieval and collection of data from their sources and storing this data in a specified format within a third data source. Data is then contracted and shared as feed data within a format such as Really Simple Syndication (RSS) and/or Resource Description Framework (RDF), which are either JSON or XML data. This collected data needs to be integrated into a database (DB), however, through conversion of the data into a suitable format of the DB. Data integration often includes two procedures. The first procedure includes determining if the data exists in the DB and then updating the data; the second one is a process of elimination or combination of the duplicates found in the heterogeneous data. If the data size is more than tera- or peta-bytes, Hadoop systems can help store and handle the big data.
Higher-level data analytics can be conducted either within a database or within an in-memory database. In-database processing includes analytical functions such as statistical analysis, text mining, data mining, and online analytical processing (OLAP). In-memory capabilities include high speed query processing, OLAP, and results caching. Data processing at a lower level can be conducted to support data ingestion, analytical processing, or other functions such as data cleaning and discovery processes. S4 (Simply Scalable Streaming System) is a distributed and general-purpose platform that is used to develop applications for processing stream data. Storm is an open-source framework for distributed, robust, and real-time computation on stream data.
There are some matured approaches and tools in natural language processing (NLP) that can be used in handling unstructured data. The SmartWeb Ontology-Based Annotation (SOBA), a system for ontology-based information extraction from heterogeneous sources (including tables, plain text, and image captions.) has been designed, implemented, and evaluated. SOBA can process structured data and unstructured data to extract information and integrate it into a coherent knowledge base. SOBA interlinks the information extracted from heterogeneous sources to create coherence as well as identifies duplicates.
One of important steps in the data analytics of heterogeneous mixture data is breaking up inherent heterogeneous mixture properties by putting the data in groups. Each group has the same rules or patterns. There is a large number of possibilities for the data grouping options; therefore, it is difficult to verify each candidate. Three important issues related to grouping the data are as follows: 1) the number of groups; 2) the method of grouping; and 3) a suitable choice of prediction models based on the features of an individual group.
General approaches to data integration set data into a common predetermined schema, or data model. The data lake is a relatively new method that relaxes standardization, which results in a higher potential for data discovery and operational insight. Data
lakes help fix problems in data integration and data accessibility. The data lake is also an emerging method for cloud-based big data. The features of data lakes include the following:
- Size and low cost: They are big and can be an order of magnitude less expensive.
- Ease of accessibility: This is a benefit of keeping the data in its original form.
- Fidelity: Hadoop data lakes keep data in its original form and capture data changes and contextual semantics throughout its lifecycle.
Compared with data warehouse systems that have a relational view of data, data lakes handle more heterogeneous data sources such as semi-structured and unstructured sources. A data lake system, Constance, has been developed which offers advanced metadata management over raw data which is extracted from heterogeneous sources. Regardless of the formats of source data (e.g., spreadsheets, relational data, XML, or JSON), Constance loads and stores the data in its original format without costly transformation procedures unlike the traditional ETL (Extract, Transform, Load) process. Data lakes have been conceptualized as repositories for big data and this kind of repositories can store raw data and have the functionality for on-demand integration.
There has been only limited research on the representation of heterogeneous big data from multi-sources, distributed storage of energy efficiency optimization, semantic comprehension methods, and processed hardware and software system architectures, etc. Researchers should also conduct further study of big data security such as completeness maintenance, credibility, and backup and recovery.