2. Background

The magnitude of data generated and shared by sources and organisations, such as: businesses, public administrations, numerous industries, non-profit sectors, and scientific research has increased immeasurably. To give an example of data magnitude, more than 500 million tweets are posted every single day, 428,571 million share contents of LinkedIn, 7 billion Google searches are launched. Additionally, a comparison done using Google Trends, shown in Figure 1, compares the search interest in big data term over 10 years (the x-axis represents the time and y-axis represents the interest in the topic search). This comparison shows the search of "big data" terms within 2008 (for the whole year) vs. 2018 (for the whole year). The red line shows people's interest in big data in 2008 whereas the blue line shows people's interest in big data 2018. This comparison is made based on interest over time thus the numbers represent search interest relative to the highest point on the graph for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means that there was not enough data for this term. The graph shows the increase in the interest in big data over the 10 past years. The chart also confirms the graph's results with a big increase over the years.



Figure 1. Interest in big data from 2008 to 2018.


Most definitions of big data analytics focus on the size of the data in storage. Size matters, but there are other important attributes of big data analytics as well which has been characterised by Erl et al. into 5V's: volume, variety, velocity, veracity, and value as presented in Figure 2. The paper provides a comprehensive definition, and breaks the myth that big data analytics is only about data volume as each of the five V's has its own ramifications for analytics. According to Data-intensive applications challenges techniques and technologies: A survey on Big Data, volume refers to "the magnitude of data, which has exponentially increased, posing a challenge to the capacity of existing storage devices" and variety refers to "the fact that data can be generated from heterogeneous sources", for example: sensors, Internet of Things (IoT), mobile devices, online social networks, etc., in structured, semi-structured, and unstructured formats. To give a holistic picture of data classification, structured data is typically stored in databases or spreadsheets. Text, audio, imagery and video refers to unstructured data, which sometimes lack the structural organisation required for analysis. Spanning a continuum between fully structured and unstructured data, the format of semi-structured data does not conform to strict standards. Velocity refers to "the speed of data generation and delivery, which can be processed in batch, real-time, nearly real-time, or stream-lines". Veracity "stresses the importance of data quality and level of trust due to the concern that many data sources (e.g., social networking sites) inherently contain a certain degree of uncertainty and unreliability". Finally, value refers to "the process of revealing underexploited values from big data to support decision-making".

Representation of the five V’s of big data.

Figure 2. Representation of the five V's of big data.


According to studies, more V's and other characteristics have been added to support a better defined big data: such as vision (a purpose), verification (processed data conformed to some specifications), validation (the purpose is fulfilled), variability (data differentiation), venue (different platforms), vocabulary (data terminology) and vagueness (indistinctness of existence in a data), complexity (it is difficult to organise and analyse big data because of evolving data relationships) and immutability (collected and stored big data can be permanent if well managed). It is worth mentioning that the correlation taxonomy proposed in this paper is not affected by a number of the big data analytics characteristics and should be applicable for any number of V's.

The big data analytics concept was used in many sectors, the latest understanding in academia specified the potential of big data analytics in five main sectors:

  • Healthcare: clinical decision support systems, individual analytics applied for patient profiling, personalised medicine, performance-based pricing for personnel, analysis of disease patterns and improvement of public health.
  • Public sector: creating transparency with accessible related data, discovering needs, improving performance, customisation of actions for suitable products and services, decision-making with automated systems to decrease risks, innovating new products and services.
  • Retail: in-store behaviour analysis, variety and price optimisation, product placement design, improve performance, labour inputs, optimisation, distribution and logistics optimisation, Web-based markets.
  • Manufacturing: improved demand forecasting, supply chain planning, sales support, developing production operations, web-search-based applications.
  • Personal location data: smart routing, geo-targeted advertising or emergency response, urban planning, new business models.

The importance of big data analytics has laid the groundwork for investigation of the methods and the techniques involved in big data, which will be explored further in the following section.