The amount of data collected is staggering. The article was written in the middle of 2019; how much data is now collected daily? The National Security Agency monitors hot spots for terrorist activities using drone feeds. They admitted several years ago that analyzing what they had already collected would take decades, and the collection continues. The key to effective analysis is identifying the most relevant datasets and applying the correct analytic techniques, returning to our mix of art and science. As the article indicates, very little has been studied on removing uncertainty from the value of datasets growing daily. At least with BI, you are typically looking mainly at the data created within your firm, which places some limits on the amounts and type of data, but in a firm as large as, say, Amazon, imagine the amount of data created every day, not only at the point of purchase but in all of its hundreds (maybe thousands) of automated fulfillment centers around the world. Looking at figure 1, the 5Vs of Big Data characteristics, think about the challenges of the kinds and amount of data collected daily by your firm. Is it housed in a common system or different systems depending on the department collecting and using it? How would you characterize its various Vs? Is it manageable? What level and types of uncertainty would you assign to the various datasets you regularly work with?
Summary of Mitigation Strategies
This paper has reviewed numerous techniques on big data analytics and the impact of uncertainty of each technique. Table 2 summarizes these findings. First, each AI technique is categorized as either ML, NLP, or CI. The second column illustrates how uncertainty impacts each technique, both in terms of uncertainty in the data and the technique itself. Finally, the third column summarizes proposed mitigation strategies for each uncertainty challenge. For example, the first row of Table 2 illustrates one possibility for uncertainty to be introduced in ML via incomplete training data. One approach to overcome this specific form of uncertainty is to use an active learning technique that uses a subset of the data chosen to be the most significant, thereby countering the problem of limited available training data.
Table 2 Uncertainty mitigation strategies
Artificial intelligence | Uncertainty | Mitigation |
---|---|---|
Machine learning |
Incomplete training samples Inconsistent classification Learning from low veracity and noisy data |
Active learning, Deep learning, Fuzzy sets, Feature selection |
Learning from unlabeled data |
Active learning |
|
Scalability |
Distributed learning Deep learning |
|
Natural language processing | Keyword search |
Fuzzy, Bayesian |
Ambiguity of words in POS | ICA, LIBLINEAR and MNB algorithm | |
Classification (simplifying language assumption) |
ICA, Open issue |
|
Computational intelligence | Low veracity, complex and noisy data |
Fuzzy logic, EA |
High volume, variety | Swarm intelligence, EA, Fuzzy-logic based matching algorithm, EA |
Note that we explained each big data characteristic separately. However, combining one or more big data characteristics will incur exponentially more uncertainty, thus requiring even further study.