Big data analytics enables scientists to analyze large quantities of data unencumbered by any preconceived theories. Read this article to discover the difference between theory-based and process-based prediction, as well as the necessity of utilizing a combined approach to overcome the inherent challenges.
Epistemological Pitfalls in the BDA Process
Investigating the epistemological challenges and pitfalls is crucial to the IS community which is becoming more and more multidisciplinary as well as multinational. Numerous authors have discussed the potential of BDA for IS research, for example, Shimueli and Koppius described six roles for predictive analytics: generating new theory, develop measurements, comparison of competing theories, improvement of existing models, relevance assessment, and assessment of the predictability of empirical phenomena. Three of these are particularly facing epistemological pitfalls, those are generating new theory, improve existing models, and assess predictability of empirical phenomena.
Recently have provided guidelines for employing BDA in IS research. They conclude that "reflecting on the guidelines, we can observe that each phase of the research process requires a revised set of actions and abilities" and advocate a skill set change for IS researchers with stronger emphasis on developing skills for data preparation and the deployment of analytical tools and cross-instrumental evaluation criteria.
However, we seek to go beyond previous work by scrutinizing in more detail the BDA steps of data acquisition, preprocessing, analysis, and interpretation in order to identify the epistemological challenges associated with BDA. Concerned with the theoretical knowledge needed to appropriately apply BDA within the frame of IS research we seek addressing the following practical questions that call for an epistemological reflection:
-
What kind of data [or datasets] about the world are available to a data scientist or researcher?
-
How can these data [sets] be represented?
-
What rules govern conclusions to be drawn from these datasets?
-
How to interpret such conclusion?
Before conducting the actual steps of analytics, the primary stage is to define an objective, or identifying a problem to solve or an opportunity to grasp. That prerequisite step helps defining what needs to be accomplished. Quite often, the researcher might have many competing objectives and constraints that need to be properly adjusted and balanced. The appropriate identification of the objective or goal supports obtaining the right data, which has cascading impact on the entire BDA process as data is linked to analytics and analytics outcome is linked to interpretation. Therefore, defining the objective of analytics usually influences the result of the BDA process, especially when generating new theory, improving existing models, and assessing predictability of empirical phenomena. Added to that, the primary objective is normally linked to other related questions that need to be addressed, too. For example, the objective of a specific Telecom Operator is to "predict customer churn" [which might: generate new theory or improve existing known models]. Related questions are for example: who are the most profitable customers? Which of the profitable customers are influencers? How many complaints do we currently get from customer segment "profitable"? What are the products and services used by our top-profitable customers? Etc.
A possible consequence of neglecting this primary stage is to spend resources on producing the right answers to the wrong questions. Also, not having a clear research objective or problem, researchers will not be able to define what data is required to be collected and are tempted to undertake 'streetlight' research (see above). Of course, such defiance is expected to harm any kind of research design, but epistemological pitfalls in BDA are different. In traditional deductive research the existing body of theoretical knowledge guides the identification of relevant constructs, relations, and variables, and therefore influences the data collection from the outset. The various forms of inductive research (e.g. action research, ethnographic research, grounded theory) also rely on certain well explained and reflected approaches to small-size sampling, data collection, and data analysis that are to be applied and balanced from the outset according to the primary research objective (e.g. descriptive, exploratory, explanatory). For BDA such reflection of research design does not yet exist. Aiming for data-driven (not theory-driven) discoveries, the best practice being applied in deductive and inductive research so far does not work in this case. Hence, we need to reexamine every step of the analytics process in order to understand what kind of theoretical knowledge may help in avoiding the appearing epistemological pitfalls.
Acquisition
Big data analytics starts with acquiring the data through copying, streaming etc. (see also "BDA challenges"). Such acquisition requires good understanding of the domain (often business context) as well as the data. Datasets, from which we source data, should be described in terms of: required data to be defined; background about the data; list of data sources; for each data source the method of acquisition or extraction; and reporting the problems encountered in data acquisition or extraction.
One of the challenges associated with big data acquisition is: on one hand, there exist too much data while, on the other hand, all acquisition requires time, effort and resources. As pointed out above, the selection by the researcher might be attributable to: personal preference; technical abilities; 'streetlight' effect; and/or data monetization impact. In practice, researchers seek technological solutions, i.e. tools to acquire and compress the data, and focus on available data. However, such solutions do not really address the epistemological problem: we know that sampling in data collection is crucial and requires a great deal of reflection pertaining to the impact of data acquisition decisions on the result of the research. Similarly, big data acquisition entails epistemological problems that require epistemological solutions, which cannot be achieved without sufficient theorization.
Preprocessing
Preprocessing activities include: check keys, referential integrity, and domain consistency; identify missing attributes and blank fields; replacing missing values; data harmonization e.g., check different values that have similar meanings such as customer, client; check spelling of values; check for outliers. In result, preprocessing provides a description of the dataset including: background (broad goals and plan for pre-processing); rationale for inclusion/exclusion of datasets; description of the pre-processing, including the actions that were necessary to address any data quality issues; detailed description of the resultant dataset, table by table and field by field; rationale for inclusion/exclusion of attributes; and the discoveries made during pre-processing and their potential implications for analytics.
Preprocessing mainly aims for big data cleansing and harmonization, while quite often overlooking the importance of 'traditional' data collection by the researcher. Big data self-confidence tends to drive preprocessing towards the assumption that big data are a substitute for, rather than a supplement to, traditional data collection and analysis. The core challenge is that most big data in focus are not the output of instruments that were designed to produce valid and reliable data amenable for rigorous knowledge discovery.
For example, a Google Flu prediction error in February 2013 resulted in doubling the proportion of doctor visits for influenza-like illness in the USA. In this case the initial error was a marriage between big data and small data. Quantity of data does not mean that one can ignore foundational issues of measurement and construct validity and reliability and dependencies among data. It is to be noted here that any empirical research must stand on a foundation of sound measurement, which not only include the data, but also its preprocessing.
Analytics
Despite the significance of predictive analytics, empirical analytics research is still rare in the IS literature. Extant IS literature relies almost exclusively on explanatory statistical modeling, where statistical inference is used to test and evaluate the explanatory power of underlying causal models, and predictive power is assumed to follow automatically from the explanatory model. Having that being said, the central step in BDA is analytics during which data mining, machine learning, statistics and other techniques, or models, are chosen and applied on the data. For the implementation of a technique (or model) a number of algorithms are available to be applied to any dataset. For example, we have conducted an experiment on a retail chain on which we have analyzed 1 year of purchase transactions for possible unnoticed relationships between products that ended up in shoppers' baskets. Discovering correlations between certain items uncovered hidden patterns, which helped marketing team to promote low selling together with high-selling items. In such experiment, there was no hypothesis that a certain product, e.g. 1000, has been often bought with another product, e.g. 2000. The data were simply queried to discover what relationships existed that might have previously been unnoticed.
It is important on such step to describe: model assumptions, model description (e.g. rule-based models list the rules produced in addition to their accuracy and coverage), and results assessment (e.g. why a certain modeling technique and certain parameter settings led to good or bad results).
One efficient approach to follow in BDA research is to identify, early enough, what one is looking for. However, data scientists responsible for the analytics process are often not aware of this epistemological challenge (i.e. know what you want to know) and/or do not have the necessary knowledge to apply in that manner. Hence, it turns out that preferences of the data scientists and their education might drive the analytics part instead of the problem at hand, leading to insufficient knowledge discovery.
In big data analytics, the variables represent the raw input whereas the feature is a variable selected or (re)constructed from raw variables. Hence, feature selection is part of pre-processing. It helps to reduce the measurement and storage requirements, reducing training and utilization times. It addresses the high-dimensionality problem. The idea is to selecting the best features that are useful to build a good predictor. This is not the same problem as finding or ranking all potentially relevant variables. Selecting the most relevant variables is usually suboptimal for building a predictor. The relationship between feature selection and model predictive accuracy should be emphasized.
In particular, the selection of algorithm's parameters by the data scientist has a profound impact on analytics. A parameter is a value to fine-tune an algorithm. For any BDA tool e.g., RapidMiner, there are often a large number of parameters that can be adjusted. Listing the parameters and their chosen values, along with the rationale for the choice, is a key task. For instance, for the K-Means clustering algorithm, setting the number of k is a parameter. Too big k might not be useful for the decision maker and too little value for k as a parameter might not solve business problems. While empirical research stands on a solid foundation of measurement, data scientists tend to overlook the fact that algorithms parameter setting not only impacts analytics, but interpretation as well.
The algorithms that preprocess and analyze big data to find patterns, trends, and relationships are in many cases treated as 'black-boxes' or 'closed'. Yet, understanding analytics algorithms is of importance because they not only extract and derive meaning from the world, but they are increasingly starting to shape it. However, in many cases, that shaping is semantically blind. For instance, Google matches ads to content without 'knowing' anything about either. The Google translate service (and the team behind it) does not understand content of the language they are providing translating for. Netflix reported that 75% of content choices made by their customers is influenced by their recommendation system.
The following two techniques may illustrate the difficulties of managers and users to understand analytics:
Support vector machines (SVM)
Discovering the right set of features is a difficult problem in machine learning. SVMs try to model such feature list. The idea of SVM is to make use of a [nonlinear] mapping function Φ, which transforms data in input space to data in feature space in such a way so as to render a problem linearly separable whereby the SVM is then able to automatically discover the decision surface (DS). There are plenty of ways where DS's could be identified, see three potential lines on Fig. 1.
SVM options for decision surface discovery
SVM then automatically discovers the optimal separating hyperplane which, when mapped back into input space via Φ − 1, could be a complex DS. The discriminating hyperplane in input space corresponds to the function:
where the is the omega vector; and the RHS represent the Sigma of Lagrange parameters [AKA alpha
parameters] over S vectors treated for input bias – the
Ensemble methods: random forests
Ensemble methods use a divide-and-conquer tactic used to improve performance. The core principle is that a cluster of weak learners could come together to procedure a strong learner.
The idea of ensemble methods is illustrated in Fig. 2 whereby each classifier individually is a weak learner, however, when taken together the classifiers represent a strong learner. The data to be modeled are represented by the blue circles. Each learner model is represented as a gray curve. Each gray curve is a fair approximation to the underlying data. The red curve represents the assembled strong learner model; which could be seen as a better approximation to the data. Random forests is based on tree induction (aka decision trees) and is frequently used in prediction, where one needs to know: bagging, pruning, cross-validation, entropy measures e.g., Gini index, etc. in order to fully understand how it works and being able to digest its results.
Ensemble methods: random forest
Improving prediction accuracy could be achieved by using ensembles. Ensembles means averaging across multiple models that rely on different data or reweighted data and/or employ different models or methods. Bagging, boosting, and random forests are frequently used ensemble methods. Ensemble method require voting. The voting operate on class labels, where dt,j is 1 or 0 depending on whether classifier t chooses j, or not, respectively. The ensemble then chooses class J that receives the largest total vote. In ensemble methods, in order to combine the classifiers, boosting takes a weighted majority vote of their predictions. On the other hand, bagging uses bootstrap samples to build the classifiers. Each bootstrap sample is constructed by randomly sampling, with replacement, the same number of instances as the original data. The final classification produced by the ensemble of these classifiers is obtained by simple majority voting.
Transparency in data collection, preprocessing, and analytics (esp. parameter setting) is inevitable in data science. The above techniques are just brief examples for the powerfulness but also for the complexity of BDA. If data scientists would not understand and know how they generate predictions, then we are unable to address epistemological issues in the BDA process. The human element of big data (analytics) is strategically important. That is, it is indeed essential to combine potentials of machine learning algorithms with human decision making skills. There is still a gap between what machine learning and statistics tools used in big data analytics could do, and what could be done with this generated knowledge i.e., role of human.
Interpretation
Interpretation should relate analytical findings to the existing body of knowledge as well as industry practices and include reflection on certain business objectives, decision making, problem solving, etc.
One of the significant epistemological problems in this step is the interpretation of 'quick & dirty' pattern discovery. The reason is mainly attributable to the fact that analytics can run easily and quickly by the data scientist, even via cloud. Given these opportunities, the pressure to reach outcomes often supersedes the genuine objective of the advancement of knowledge.
Another issue is the contradiction of predictive and explanatory power. Often, BDA provides us with higher accuracy prediction, but this accuracy comes at a cost. That is, most accurate algorithms such as SVM, Naïve Bayes Classifiers, topic modeling in text analytics, and random forests are not easily comprehensible by most of those who are supposed to consume their results i.e., decision makers and managers. In other words, BDA utilizes algorithms that are good in predicting future or unknown events, but unable to provide easy-to-comprehend explanations for their predictions. On the other hand, many decision makers and managers have learned to interpret regression results. Therefore, we are facing a situation whereby, when users are to choose between accuracy (of BDA algorithms) and interpretability (of 'traditional' statistics), many would favor interpretability.
One example here is the research conducted by which has resulted in the construction of a corpus of digitized books containing 4% of the books ever printed, a corpus of 5.2 million digitized books. Applying analytics on such big data enables better understanding of cultural trends that have prevailed in history. Researchers have used such big data corpus in order to understand grammar, collective memory, technology adoption etc. Such corpus is a result of Google efforts to digitize books. Having that being said, one should be very conservative in interpreting results obtained from such corpus since the corpus contains >500 billion words but not equally represented languages. That is, it has such number of words per language: English 361B, French 45B, Spanish 45B, German 37B, Chinese 13B, Russian 35B, and Hebrew 2B. Added, the corpus was collected from approximately 40 university libraries worldwide i.e., not a large representation.
Big data analytics uses various techniques, such as machine learning techniques, to identify the likelihood of future outcomes based on applying those techniques on available datasets. Those datasets are generated from a variety of sources having different representation forms and formats. Added to that, the techniques have assumptions and parameters. All of that raises risks of (mis)interpretation and hence render the business decision made based on BDA findings invalid! Therefore, businesses utilizing BDA outcomes should investigate further the steps of BDA in order to safeguard their knowledge discovery activities as well as their fact-based decision making. Addressing this gap, we introduce a theory-driven guidance to avoid the epistemological pitfalls and to help mitigating the epistemological challenges encountered during the BDA process.