Big Data Analytics for Disparate Data
4. Data Mining and Machine Learning for Disparate Data
4.1. Dimension Reduction
Principal components analysis (PCA) is an unsupervised method of handling high-dimensional data; a dataset is transformed to a new coordinate system from its original coordinate system. The first axis of the new coordinate system is chosen in the direction of the most variance in the dataset and the second axis is orthogonal to the first axis and in the direction of the second largest variance. This procedure is repeated and it is found that the majority of the variance is contained in the first few axes. PCA is especially effective when the columns of data are strongly correlated and in this situation, the correlated columns can be replaced with a single column, which reduces data dimensionality, reduces data complexity, and identifies the most important features of the data.
Factor analysis is another method for dimension reduction and it can be assumed in factor analysis that some unobservable latent variables generate the observed data. It is also assumed that the observed data is a linear combination of the latent variables and some noise. The number of the latent variables is usually less than the number of variables in the observed data, which achieves dimension reduction. Exploratory factor analysis (EFA) is a series of methods which are designed to uncover the latent structure in a given set of variables. Both PCA and EFA are based on correlation matrices and it is important to remove or impute missing data before proceeding with analysis.
Challenges in Big Data analytics arise due to the high dimensionality and the amount of data. High dimensionality can also result in spurious correlations because unrelated features may be correlated simply by chance, which leads to erroneous inferences and false discoveries.4.2. Performance, Evaluation Criteria and Challenges of Data Mining
Performance issues of data mining include scalability, efficiency, and the parallelization of algorithms, etc. Data mining algorithms need to be scalable and efficient in order to extract information effectively from a large amount of data. In other words, the run time of data mining algorithms should be acceptable and predictable. A lot of research is still needed to overcome the challenges related to accuracy, scalability, heterogeneity, speed, provenance, trust, privacy, and inter-activeness. Provenance is directly related to the trust and accuracy of the source data and the results of data mining. However, provenance information is not always available or recorded. Numerous problems regarding provenance and mining have not been solved because there is a lack of many sources.
The data mining speed is strongly related to two main factors: the efficiency of data mining algorithms and the data access time. It is necessary to improve the speed of data mining and big data access by exploiting and identifying potential parallelisms for the data access and data mining algorithms. The original dataset is divided into many small subsets in data parallelism and the same program runs on each of the partitions; the results are then merged to get a final result. The computational complexity of some data mining algorithms, the large data size of many databases, and widely distributed data are motivating factors for developing parallel and distributed data mining algorithms. Such algorithms are used to divide the data into partitions, note that the division process is parallel as well as the parallelism of the computation process of data mining. It is difficult for a single processor system to provide responses and results efficiently for large data mining if parallelism is not performed. There are some versions of parallel clustering (one of parallel data mining methods) such as parallel clustering based on parallel k -means. However, parallel data mining introduces new complexity as it incorporates techniques and algorithms into parallel programming and databases.
There are numerous challenges in data mining research. Some challenges are as follows: 1) scalability for high speed stream data and high-dimension data, 2) knowledge mining from complex data, 3) mining multi-agent data and distributed data mining, 4) handling unbalanced, non-static, and cost-sensitive data, 5) network data mining, 6) mining time-series data and sequence data, and 7) development of a unified theory of data mining. Big Data analytics and mining has the potential to extract valuable information from big stream data due to its volume, velocity, and variability. Heterogeneous mixture learning is also an advanced technology to be developed for the analysis of heterogeneous data.
4.3. Traditional Machine Learning Methods and Deep Learning
Traditional machine learning (ML) methods have the following limitations:
- For distributed data sources, data repositories are physically distributed, often dynamic, and big in volume. It is often not practicable to put all the data into a central location for analysis. Therefore, knowledge acquisition systems are required to perform necessary analyses at data locations so that the results of analyses are transmitted to the needed locations for further process or analysis. The systems are also required to learn from statistical summary of data.
- It is often required to reconcile semantic differences from the user's point of view when heterogeneous sources in a given context are used. For learning purposes, methods are needed to efficiently and dynamically extract and integrate information from distributed and semantically heterogeneous sources according to user-specified ontologies and mappings between ontologies.
The following three primary reasons make traditional ML methods unsuitable for solving big data classification problems: firstly, an ML method generally performs training using some class types. There are often many varieties of class types (even new class types) in a dynamically growing dataset, which possibly results in inaccurate classification results. Secondly, an ML method that performs training on a particular data domain or labeled dataset is possibly not suitable for another data domain or dataset; therefore, the classification based on the ML method is probably not robust over different data domains or datasets. Thirdly, an ML method is based on a single learning task and is unsuitable to fulfill the multiple learning tasks and knowledge transfer requirements of Big Data analytics. The support vector machine (SVM) has a good performance among ML methods; however, it works well only for datasets with a moderate size and has limitations for big data applications.
In addition to the challenge of handling large volumes of data, Big Data analytics poses some unique challenges for ML. These challenges lie in highly distributed input sources, variety in formats of the raw data, data trustworthiness, high dimension data, un-categorized and unsupervised data, algorithm scalability, limited supervised/labeled data, fast moving stream data, data with noise and in poor quality, and imbalanced input data.
Deep learning architectures and algorithms are suitable to deal with issues related to the variety and volume of big data. The challenges of deep learning in Big Data analytics for disparate data include handling high-dimension data, distributed computing, learning with stream data, and model scalability. Distributed learning has attracted considerable attention. Among other methods of learning from large datasets, high performance parallel computing is very useful in dealing with distributed data.