1. Introduction

Many data sets are heterogeneous in type, structure, organization, granularity, semantics, and accessibility, etc. The high diversity of data sources often leads to data silos, a collection of non-integrated data management systems with heterogeneous schemas, APIs, and query languages. Data types from heterogeneous sources are often required to be unified during pre-processing. Holistic data integration methods for achieving scalability to different sources should be automatic or only require minimal manual interaction. Data integration is a process involving the combination of multiple local sources without putting their data into a central warehouse. It can ensure the interoperation of the sources and access to the up-to-date data. It is important for heterogeneous data sources to be harmonized into a single data framework before they are consolidated and integrated. Efforts are therefore required to develop a system that can map different standards to a common format or to create semantic interoperability between the standards.

Disparate data is heterogeneous data that is collected from any number of sources. The sources may be known or unknown and include various formats. Disparate data includes a lot of noise and many inaccurate records making it necessary to filter the noise and remove these inaccurate records. Big data is often identified as disparate data when the sources are heterogeneous. Big data was defined as "datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze". Big data often ranges from a few dozen terabytes (TB: approximately 1012 bytes) to multiple petabytes (PB: approximately 1015 bytes). Big data is often represented by large amounts of high-dimensional and poorly structured or organized forms when the data is typically generated from heterogeneous sources. It can be either structured (e.g. spreadsheets, relational databases), unstructured (e.g. text, image), and/or semi-structured such as radio frequency identification (RFID) data and extensible markup language (XML) data. Big Data is often selective, incomplete, and erroneous. Characteristics of big data can be categorized into "7 Vs" as follows: 
  • Volume: massive amounts of data. 
  • Variety: heterogeneity of data types, representation, and semantic interpretation. 
  • Velocity: data is generated at a rate exceeding those of traditional systems. 
  • Variability: data changes (dynamic) during processing and the lifecycle. 
  • Veracity: accuracy, truthfulness, and reliability. -
  • Valence: connectedness; being connected when two data items are related to each other. 
  • Value: added value brought from the collected data.

It is difficult for traditional data models to handle complex data within the framework of big data. There have not been any acknowledged effective and efficient data models to handle big data. Table 1 describes big data in various aspects. Internet of Things (IoT) are items that are identifiable as a part of the Internet. They enable better processes and offer better services when they connect with each other on the Internet. The process of changing data into an appropriate format for analysis is defined as transform. When a store is column-oriented, data is stored in columns and attribute values belonging to the same column are stored contiguously. A document-oriented store supports complex data forms, such as JSON, XML, and binary forms. Key-value helps store and access data with a very large size. A graph database uses graph models with nodes, edges, and properties related to each other through relations. Big Data analytics is powerful in discovering unapparent correlations in the data.

Table 1. Big Data in Different Aspects

Data Formats Data Sources Data Processing Data Staging Data Stores
  • Structured
  • Semi-structured
  • Unstructured
  • Transactions
  • Web & Social
  • Sensing
  • Machine
  • IoT
  • Batch
  • Real time
  • Normalization
  • Cleansing
  • Transform
  • Column-oriented
  • Document-oriented
  • Key-value
  • Graph based

This paper focuses on Variety (various data types and formats) and Veracity (data quality issues) of big data because Variety and Veracity are two key issues of disparate data. The organization of this paper is as follows: the next section introduces disparate data and big data including data quality problems and data variety in disparate data, handling missing data, removing duplicates and redundancy, and some ideas and strategic focuses of Big Data analytics for disparate data; Section 3 presents some methods, challenges and technical progress in Big Data analytics for disparate data; Section 4 discusses the limitations of traditional data mining and machine learning Big Data analytics, the strength of deep learning in handling the variety and volume of big data, and its challenges in Big Data analytics for disparate data; the final section is the conclusion.