Big Data Analytics

It is difficult nowadays to open a popular publication today, online or in the physical world and not run into a reference to data science, analytics, big data, or some combination thereof. Big data are data whose scale, distribution, diversity, and velocity require the use of technical architectures, analytics, and tools in order to enable insights that reveal hidden knowledge and create value to business. Three main features characterize big data: volume, variety, and velocity (aka the three V's). The volume of the data is its size, and how enormous it is. Velocity refers to the rate with which data is changing, or how often it is created. Finally, variety includes the different formats and types of data, as well as the different kinds of uses and ways of analyzing the data. Data volume is the primary attribute of big data. Big data can be quantified by size in TBs or PBs, as well as even the number of records, transactions, tables, or files. Additionally, one of the things that make big data really big is that it is coming from a greater variety of sources than ever before, including IoT data, logs, clickstreams, and social media. Using these sources for analytics means that common structured data is now joined by unstructured data, such as text and human language, and semi-structured data, such as extensible markup language (XML), JSON or rich site summary (RSS) feeds. Furthermore, multi-dimensional data can be drawn from a data warehouse to add historic context to big data. Thus, with big data, variety is just as big as volume. Moreover, big data can be described by its velocity or speed. This is basically the frequency of data generation or the frequency of data delivery. The leading edge of big data is streaming data, which is collected in real-time from the websites. Some researchers and organizations have discussed the addition of a fourth V, or veracity. Veracity focuses on the quality of the data. This characterizes big data quality as good, bad, or undefined due to data inconsistency, incompleteness, ambiguity, latency, deception, and approximations.

The interest in BDA research is on the increase. Google's adoption of the MapReduce was definitely a catalyst, which has led to a lot of developments in the area of BDA. Further, the development and deployment of Apache Hadoop, SPARK, and Mahout has also opened the doors for organizations to process extremely large datasets that has never been possible. BDA is the use of advanced techniques, mostly data mining and statistical, to find (hidden) patterns in (big) data. BDA is where advanced techniques operate on big datasets. The term "Big Data" has recently been applied to datasets that grow so large that they become awkward to work with using traditional database management systems. A significant amount of these techniques rely on commercial tools such as relational DBMS, data warehousing, ETL, OLAP, and business analytics tools. During the IEEE 2006 International Conference on Data Mining (ICDM), the top-ten data mining algorithms were defined based on expert nominations, citation counts, and a community survey. In order, those algorithms are: C4.5, k-means, support vector machine (SVM), Apriori, expectation maximization (EM), PageRank, AdaBoost, k-nearest neighbors (kNN), Naïve Bayes, and CART. They cover classification, clustering, regression, association analysis, and network analysis. Actually, not only organizations and governments generate data; each and every one of us now is a data generator. We produce data using our mobile phones, social networks interactions, GPS, etc. Most of such data, however, is not structured in a way so as to be stored and/or processed in traditional DBMS. This calls for BDA techniques in order to make sense out of such data.

Big data analytics is inherently related to data mining, a term that has often been used interchangeably with knowledge discovery in database (KDD). However, we see data mining as a step towards knowledge discovery. The term KDD was coined in 1989 to point to the process of finding knowledge in data. KDD is also defined as the process of finding patterns hidden information or unknown facts in the database. Traditionally the notion of finding useful unknown patterns and hidden information in raw data has been given many titles including knowledge discovery in database, data mining, data archaeology, information discovery, knowledge discovery or extraction, and information harvesting. The lack of consensus on the term is attributable to the relative novelty as well as the multi-disciplinary nature of KDD. Multi-disciplinary means that KDD belongs to many disciplines like statistics and computer [machine learning, artificial intelligence (AI), databases, data warehousing, expert systems, knowledge acquisition and data visualization]. Data mining is considered a step in the KDD process of discovering useful knowledge from data while data mining points to the application algorithm or technique used for extracting patters and unknown information from the raw data.

Big data analytics is mostly used with the intention to predict. Prediction is the ability to foresee the future, based on applying certain techniques on datasets. Predictive analytics is a process whereby information extracted from various data sources is utilized to elucidate patterns as well as predict the future. Predictive analytics has the potentials to bring great business value to organizations and individuals equally. Added to that, prediction has been identified as a key research area of the future.

On the other hand, predictive analytics is differentiated from prescriptive analytics which refers to the determination of a course of actions or decisions. In other words, the focus of prediction is on what will happen, whereas the focus of prescription is on how to make it happen. For example, in a telecommunications operator content, predicting works to identify which customer will churn, while prescription works in ways to avoid it from happening via say simulation models.