The amount of data collected is staggering. The article was written in the middle of 2019; how much data is now collected daily? The National Security Agency monitors hot spots for terrorist activities using drone feeds. They admitted several years ago that analyzing what they had already collected would take decades, and the collection continues. The key to effective analysis is identifying the most relevant datasets and applying the correct analytic techniques, returning to our mix of art and science. As the article indicates, very little has been studied on removing uncertainty from the value of datasets growing daily. At least with BI, you are typically looking mainly at the data created within your firm, which places some limits on the amounts and type of data, but in a firm as large as, say, Amazon, imagine the amount of data created every day, not only at the point of purchase but in all of its hundreds (maybe thousands) of automated fulfillment centers around the world. Looking at figure 1, the 5Vs of Big Data characteristics, think about the challenges of the kinds and amount of data collected daily by your firm. Is it housed in a common system or different systems depending on the department collecting and using it? How would you characterize its various Vs? Is it manageable? What level and types of uncertainty would you assign to the various datasets you regularly work with?
Discussion
This paper has discussed how uncertainty can impact big data, both in terms of analytics and the dataset itself. Our aim was to discuss the state of the art with respect to big data analytics techniques, how uncertainty can negatively impact such techniques, and examine the open issues that remain. For each common technique, we have summarized relevant research to aid others in this community when developing their own techniques. We have discussed the issues surrounding the five V's of big data, however many other V's exist. In terms of existing research, much focus has been provided on volume, variety, velocity, and veracity of data, with less available work in value (e.g., data related to corporate interests and decision making in specific domains).
Future research directions
This paper has uncovered many avenues for future work in this field. First, additional study must be performed on the interactions between each big data characteristic, as they do not exist separately but naturally interact in the real world. Second, the scalability and efficacy of existing analytics techniques being applied to big data must be empirically examined. Third, new techniques and algorithms must be developed in ML and NLP to handle the real-time needs for decisions made based on enormous amounts of data. Fourth, more work is necessary on how to efficiently model uncertainty in ML and NLP, as well as how to represent uncertainty resulting from big data analytics. Fifth, since the CI algorithms are able to find an approximate solution within a reasonable time, they have been used to tackle ML problems and uncertainty challenges in data analytics and process in recent years. However, there is a lack of CI metaheuristics algorithms to apply to big data analytics for mitigating uncertainty.