Introduction

According to the National Security Agency, the Internet processes 1826 petabytes (PB) of data per day. In 2018, the amount of data produced every day was 2.5 quintillion bytes. Previously, the International Data Corporation (IDC) estimated that the amount of generated data will double every 2 years, however 90% of all data in the world was generated over the last 2 years, and moreover Google now processes more than 40,000 searches every second or 3.5 billion searches per day. Facebook users upload 300 million photos, 510,000 comments, and 293,000 status updates per day. Needless to say, the amount of data generated on a daily basis is staggering. As a result, techniques are required to analyze and understand this massive amount of data, as it is a great source from which to derive useful information.

Advanced data analysis techniques can be used to transform big data into smart data for the purposes of obtaining critical information regarding large datasets. As such, smart data provides actionable information and improves decision-making capabilities for organizations and companies. For example, in the field of health care, analytics performed upon big datasets (provided by applications such as Electronic Health Records and Clinical Decision Systems) may enable health care practitioners to deliver effective and affordable solutions for patients by examining trends in the overall history of the patient, in comparison to relying on evidence provided with strictly localized or current data. Big data analysis is difficult to perform using traditional data analytics as they can lose effectiveness due to the five V's characteristics of big data: high volume, low veracity, high velocity, high variety, and high value. Moreover, many other characteristics exist for big data, such as variability, viscosity, validity, and viability. Several artificial intelligence (AI) techniques, such as machine learning (ML), natural language processing (NLP), computational intelligence (CI), and data mining were designed to provide big data analytic solutions as they can be faster, more accurate, and more precise for massive volumes of data. The aim of these advanced analytic techniques is to discover information, hidden patterns, and unknown correlations in massive datasets. For instance, a detailed analysis of historical patient data could lead to the detection of destructive disease at an early stage, thereby enabling either a cure or more optimal treatment plan. Additionally, risky business decisions (e.g., entering a new market or launching a new product) can profit from simulations that have better decision-making skills.

While big data analytics using AI holds a lot of promise, a wide range of challenges are introduced when such techniques are subjected to uncertainty. For instance, each of the V characteristics introduce numerous sources of uncertainty, such as unstructured, incomplete, or noisy data. Furthermore, uncertainty can be embedded in the entire analytics process (e.g., collecting, organizing, and analyzing big data). For example, dealing with incomplete and imprecise information is a critical challenge for most data mining and ML techniques. In addition, an ML algorithm may not obtain the optimal result if the training data is biased in any way. Wang et al. introduced six main challenges in big data analytics, including uncertainty. They focus mainly on how uncertainty impacts the performance of learning from big data, whereas a separate concern lies in mitigating uncertainty inherent within a massive dataset. These challenges normally present in data mining and ML techniques. Scaling these concerns up to the big data level will effectively compound any errors or shortcomings of the entire analytics process. Therefore, mitigating uncertainty in big data analytics must be at the forefront of any automated technique, as uncertainty can have a significant influence on the accuracy of its results.

Based on our examination of existing research, little work has been done in terms of how uncertainty significantly impacts the confluence of big data and the analytics techniques in use. To address this shortcoming, this article presents an overview of the existing AI techniques for big data analytics, including ML, NLP, and CI from the perspective of uncertainty challenges, as well as suitable directions for future research in these domains. The contributions of this work are as follows. First, we consider uncertainty challenges in each of the 5 V's big data characteristics. Second, we review several techniques on big data analytics with impact of uncertainty for each technique, and also review the impact of uncertainty on several big data analytic techniques. Third, we discuss available strategies to handle each challenge presented by uncertainty.

To the best of our knowledge, this is the first article surveying uncertainty in big data analytics. The remainder of the paper is organized as follows. "Background" section presents background information on big data, uncertainty, and big data analytics. "Uncertainty perspective of big data analytics" section considers challenges and opportunities regarding uncertainty in different AI techniques for big data analytics. "Summary of mitigation strategies" section correlates the surveyed works with their respective uncertainties. Lastly, "Discussion" section summarizes this paper and presents future directions of research.