There are some common issues when dealing with big data. Two critical ones are data quality and data variety (such as multiple formats within the dataset) – deep learning techniques, such as dimension reduction, can be used to solve these problems. Traditional data models and machine learning methods struggle to deal with these data issues, further supporting the case of deep learning, as the former cannot handle complex data with the framework of Big Data.
Using the 7Vs characteristics of big data, assign an issue you recognize in your industry to each and think of possible solutions. Additionally, write down the positives and if some of these key points could be added elsewhere in your industry in a different manner.1. Introduction
Many data sets are heterogeneous in type, structure, organization, granularity, semantics, and accessibility, etc. The high diversity of data sources often leads to data silos, a collection of non-integrated data management systems with heterogeneous schemas, APIs, and query languages. Data types from heterogeneous sources are often required to be unified during pre-processing. Holistic data integration methods for achieving scalability to different sources should be automatic or only require minimal manual interaction. Data integration is a process involving the combination of multiple local sources without putting their data into a central warehouse. It can ensure the interoperation of the sources and access to the up-to-date data. It is important for heterogeneous data sources to be harmonized into a single data framework before they are consolidated and integrated. Efforts are therefore required to develop a system that can map different standards to a common format or to create semantic interoperability between the standards.
Disparate data is heterogeneous data that is collected from any number of sources. The sources may be known or unknown and include various formats. Disparate data includes a lot of noise and many inaccurate records making it necessary to filter the noise and remove these inaccurate records. Big data is often identified as disparate data when the sources are heterogeneous. Big data was defined as "datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze". Big data often ranges from a few dozen terabytes (TB: approximately 1012 bytes) to multiple petabytes (PB: approximately 1015 bytes). Big data is often represented by large amounts of high-dimensional and poorly structured or organized forms when the data is typically generated from heterogeneous sources. It can be either structured (e.g. spreadsheets, relational databases), unstructured (e.g. text, image), and/or semi-structured such as radio frequency identification (RFID) data and extensible markup language (XML) data. Big Data is often selective, incomplete, and erroneous. Characteristics of big data can be categorized into "7 Vs" as follows:- Volume: massive amounts of data.
- Variety: heterogeneity of data types, representation, and semantic interpretation.
- Velocity: data is generated at a rate exceeding those of traditional systems.
- Variability: data changes (dynamic) during processing and the lifecycle.
- Veracity: accuracy, truthfulness, and reliability. -
- Valence: connectedness; being connected when two data items are related to each other.
- Value: added value brought from the collected data.
It is difficult for traditional data models to handle complex data within the framework of big data. There have not been any acknowledged effective and efficient data models to handle big data. Table 1 describes big data in various aspects. Internet of Things (IoT) are items that are identifiable as a part of the Internet. They enable better processes and offer better services when they connect with each other on the Internet. The process of changing data into an appropriate format for analysis is defined as transform. When a store is column-oriented, data is stored in columns and attribute values belonging to the same column are stored contiguously. A document-oriented store supports complex data forms, such as JSON, XML, and binary forms. Key-value helps store and access data with a very large size. A graph database uses graph models with nodes, edges, and properties related to each other through relations. Big Data analytics is powerful in discovering unapparent correlations in the data.
Table 1. Big Data in Different Aspects
Data Formats | Data Sources | Data Processing | Data Staging | Data Stores |
|
|
|
|
|