Big Data: an overview

Today large data sources are ubiquitous throughout the world. Data used for processing may be obtained from measuring devices, radio frequency identifiers, social network message flows, meteorological data, remote sensing, location data streams of mobile subscribers and devices, and audio and video recordings. So, as Big Data is more and more used all over the world, a new and important research field is being established. The mass distribution of the technology and innovative models that utilize these different kinds of devices and services, appeared to be a starting point for the penetration of Big Data in almost all areas of human activity, including the commercial sector and public administration.

Nowadays, Big Data and the continuing dramatic increase in human and machine-generated data associated with it are quite evident. However, do we actually know what Big Data is, and how close are the various definitions put forward for this term? For instance, there was a article in Forbes in 2014 which is related to this controversial question. It gave a brief history of the establishment of the term, and provided several existing explanations and descriptions of Big Data to improve the core understanding of the phenomenon. On the other hand, Berkeley School of Information published a list with more than 40 definitions of the term.

As Big Data covers various fields and sectors, the meaning of this term should be specifically defined in accordance with the activity of the specific organization/person. For instance, in contrast to industry-driven Big Data "V's" definitions, Dr. Ivo Dinov for his research scope listed another data's multi-dimensional characteristics such as data size, incompleteness, incongruency, complex representation, multiscale nature and heterogeneity of its sources.

In this paper the modified Gartner Inc. definition is used: Big Data is a technology to process high-volume, high-velocity, high-variety data or data-sets to extract intended data value and ensure high veracity of original data and obtained information that demand cost-effective, innovative forms of data and information processing (analytics) for enhanced insight, decision making, and processes contro.


Big Data processing methods

Currently, there exist many different techniques for data analysis, mainly based on tools used in statistics and computer science. The most advanced techniques to analyze large amounts of data include: artificial neural networks; models based on the principle of the organization and functioning of biological neural networks; methods of predictive analysis; statistics; Natural Language Processing; etc. Big Data processing methods embrace different disciplines including applied mathematics, statistics, computer science and economics. Those are the basis for data analysis techniques such as Data Mining, Neural Networks, Machine Learning, Signal Processing and Visualization Methods. Most of these methods are interconnected and used simultaneously during data processing, which increases system utilization tremendously (see Fig. 1).

Fig. 1

Big Data processing methods interconnection. Applied mathematics, statistics, economics and computer science are foundation o


Big Data processing methods interconnection. Applied mathematics, statistics, economics and computer science are foundation of the Big Data processing methods. Meanwhile, Data Mining, Signal Processing, Neural Networks, Visualization and Machine learning are strongly connected to each other

We would like to familiarize reader with the primary methods and techniques in Big Data processing. As this topic is not a focus of the paper, this list is not exhaustive. Nevertheless, the main interconnections between these methods are shown and application examples are given.

Optimization methods are mathematical tools for efficient data analysis. Optimization includes numerical analysis focused on problem solving in various Big Data challenges: volume, velocity, variety and veracity that will be discussed in more detail later. Some widely used analytical techniques are genetic programming, evolutionary programming and particle swarm optimization. Optimization is focused on the search of the optimal set of actions needed to improve system performance. Notably, genetic algorithms are also a specific part of machine learning direction. Moreover, statistical testing, predictive and simulation models are applied also as for Statistics methods.

Statistics methods are used to collect, organize and interpret data, as well as to outline interconnections between realized objectives. Data-driven statistical analysis concentrates on implementation of statistics algorithms. A/B testing technique is an example of a statistics method. In terms of Big Data there is a possibility to perform a variety of tests. The aim of A/B tests is to detect statistically important differences and regularities between groups of variables to reveal improvements. Besides, statistical techniques contain cluster analysis, data mining and predictive modelling methods. Some techniques in spatial analysis originate from the field of statistics as well. It allows analysis of topological, geometric or geographic characteristics of data sets.

Data mining includes cluster analysis, classification, regression and association rule learning techniques. This method is aimed at identifying and extracting beneficial information from extensive data or datasets. Cluster analysis is based on principles of similarities to classify objects. This technique belongs to unsupervised learning where training data is used. Classification is a set of techniques which are aimed at recognizing categories with new data points. In contrast to cluster analysis, a classification technique uses training data sets to discover predictive relationships. Regression is a set of a statistical techniques that are aimed at determining changes between dependent and independent variables. This technique is mostly used for prediction or forecasting. Association rule learning is set of techniques designed to detect valuable relationships or association rules among variables in databases.

Machine Learning is a significant area in computer science which aims to create algorithms and protocols. The main goal of this method is to improve computers' behaviors on the basis of empirical data. Its implementation allows recognition of complicated patterns and automatic application of intelligent decision-making based on. Pattern recognition, natural language processing, ensemble learning and sentiment analysis are examples of machine learning techniques. Pattern recognition is a set of techniques that use a certain algorithm to associate an output value with a given input value. Classification technique is an example of this. Natural language processing takes its origins from computer science within the fields of artificial intelligence and linguistics. This set of techniques performs analysis of human language. Sometimes it uses a sentiment analysis that is able to identify and extract specific information from text materials evaluating words, degree and strength of a sentiment. Ensemble learning in automated decision-making systems is a useful technique for diminishing variance and increase accuracy. It aims to solve diverse machine learning issues such as confidence estimation, missing feature and error correction, etc.

Signal processing consists of various techniques that are part of electrical engineering and applied mathematics. The key aspect of this method is the analysis of discrete and continuous signals. In other words, it enables the analog representation of physical quantities (e.g. radio signals or sounds, etc.). Signal detection theory is applied to evaluate the capacity for distinguishing between signal and noise in some techniques. A time series analysis includes techniques from both statistics and signal processing. Primarily, it is designed to analyze sequences of data points with a demonstration of data values at consistent times. This technique is useful to predict future data values based on knowledge of past ones. Signal processing techniques can be applied to implement some types of data fusion. Data fusion combines multiple sources to obtain improved information that is more relevant or less expensive and has higher quality.

Visualization methods concern the design of graphical representation, i.e. to visualize the innumerate amount of the analytical results as diagrams, tables and images. Visualization for Big Data differs from all of the previously mentioned processing methods and also from traditional visualization techniques. To visualize large-scale data, feature extraction and geometric modelling can be implemented. These processes are needed to decrease the data size before actual rendering. Intuitively, visual representation is more likely to be accepted by a human in comparison with unstructured textual information. The era of Big Data has been rapidly promoting the data visualization market. According to Mordor Intelligence the visualization market will increase at a compound annual growth rate (CAGR) of 9.21 % from $4.12 billions in 2014 to $6.40 billions by the end of 2019. SAS Institute provides results of an International Data Group (IDG) research study in the white paper. The research is focused on how companies are performing Big Data analysis. It shows that 98 % of the most effective companies working with Big Data are presenting results of the analysis via visualization. Statistical data from this research provides evidence of the visualization benefits in terms of decision-making improvement, better ad-hoc data analysis, improved collaboration and information sharing inside/outside an organization.

Nowadays, different groups of people including designers, software developers and scientists are in the process of searching for new visualization tools and opportunities. For example, Amazon, Twitter, Apple, Facebook and Google are companies that utilize data visualization in order to make appropriate business decisions. Visualization solutions can provide insights from different business perspectives. First of all, implementation of advanced visualization tools enables rapid exploration of all customers/users data to improve customer-company relationships. It allows marketers to create more precise customer segments based on data from purchasing history or life stage and other factors. Besides, correlation mapping may assist in the analysis of customer/user behavior to identify and analyze the most profitable of them. Secondly, visualization capabilities allow companies opportunities to reveal correlations between product, sales and customer profiles. Based on gathered metrics, organizations may provide novel special offers to their customers. Moreover, visualization enables tracking of revenue trends and can be useful for risk analysis. Thirdly, visualization as a tool provides better understanding of data. Higher efficiency is reached by obtaining relevant, consistent and accurate information. So, visualized data could assist organizations to find different effective marketing solutions. In this section we familiarized the reader with the main techniques of data analysis and described their strong correlation to each other. Nevertheless, the Big Data era is still in the beginning stage of its evolution. Therefore, Big Data processing methods are evolving to solve the problems of Big Data and new solutions are continuously being developed. By this statement we mean that big world of Big Data requires multiple multidisciplinary methods and techniques that lead to better understanding of the complicated structures and interconnections between them.


Big Data challenges

Big Data has some inherent challenges and problems that can be primarily divided into three groups according to Akerkar et al.: (1) data, (2) processing and (3) management challenges (see Fig. 2). While dealing with large amounts of information we face such challenges as volume, variety, velocity and veracity that are also known as 5V of Big Data. As those Big Data characteristics are well examined in scientific literature we will only discuss them briefly. Volume refers to the large amount of data, especially, machine-generated. This characteristic defines a size of the data set that makes its storage and analysis problematic utilizing conventional database technology. Variety is related to different types and forms of data sources: structured (e.g. financial data) and unstructured (social media conversations, photos, videos, voice recordings and others). Multiplicity of the various data results in the issue of its handling. Velocity refers to the speed of new data generation and distribution. This characteristic requires the implementation of real-time processing for the streaming data analysis (e.g. on social media, different types of transactions or trading systems, etc.). Veracity refers to the complexity of data which may lead to a lack of quality and accuracy. This characteristic reveals several challenges: uncertainty, imprecision, missing values, misstatement and data availability. There is also a challenge regarding data discovery that is related to the search of high quality data in data sets.

Fig. 2

Big Data challenges.The picture illustrates three main categories of Big Data challenges that are associated with data, its m


Big Data challenges.The picture illustrates three main categories of Big Data challenges that are associated with data, its management and processing issues

The second branch of Big Data challenges is called processing challenges. It includes data collection, resolving similarities found in different sources, modification data to a type acceptable for the analysis, the analysis itself and output representation, i.e. the results visualization in a form most suitable for human perception.

The last type of challenge offered by this classification is related to data management. Management challenges usually refer to secured data storage, its processing and collection. Here the main focuses of study are: data privacy, its security, governance and ethical issues. Most of them are controlled based on policies and rules provided by information security institutes on state or international levels.

Over past generations, the results of analyzed data were represented as visualized plots and graphs. It is evident that collections of complex figures are sometimes hard to perceive, even by well-trained minds. Nowadays, the main factors causing difficulties in data visualization continue to be the limitations of human perception and new issues related to display sizes and resolutions. This question is studied in detail further in the section "Integration with Augmented and Virtual Reality". Preparatory to the visualization, the main interaction problem is in the extraction of the useful portion of information from massive volumes. Extracted data is not always accurate and mostly overloaded with excrescent information. Visualization technique is useful for simplifying information and transforming it into a more accessible form for human perception.

In the near future, petascale data may cause analysis failures because of traditional approaches in usage, i.e. when the data is stored on a memory disk continuously waiting for further analysis. Hence, the conservative approach of data compressing may become ineffective in visualization methods. To solve this issue, developers should create a flexible tool for the practice of data collection and analysis. Increases in data size make the multilevel hierarchy approach incapable in data scalability. Hierarchy becomes complex and intensive, making navigation difficult for user perception. In this case, a combination of analytics and Data Visualization may enable more accessible data exploration and interaction, which would allow improving insights, outcomes and decision-making.

Contemporary methods, techniques and tools for data analysis are still not flexible enough to discover valuable information in the most efficient way. The question of data perception and presentation remains open. Scientists face the task of uniting the abstract world of data and the physical world through visual representation. Meanwhile, visualization-based tools should fulfill three requirements: expressiveness (demonstrate exactly the information contained in the data), effectiveness (related to cognitive capabilities of human visual system) and appropriateness (cost-value ratio for visualization benefit assessment). Experience of previously used techniques can be repurposed to achieve more beneficial and novel goals in Big Data perception and representation.