Big data analytics enables scientists to analyze large quantities of data unencumbered by any preconceived theories. Read this article to discover the difference between theory-based and process-based prediction, as well as the necessity of utilizing a combined approach to overcome the inherent challenges.
Science, Theory, and Predictability
The philosophy of science is rich of different schools of thought. From ancient times until today the ontological and epistemological underpinnings have changed and depend on the tradition and interest of the researcher. For example, science can be discovery (in the school of positivism) or social construction, and accordingly the function of theory and their role for scientific prediction vary widely. However, even the harshest critics of universally fixed understanding of science, who advanced the idea of epistemological anarchism, agree that science entails a disciplined way to study the natural and/or socially constructed world. In that line the word 'science' has become increasingly associated with the scientific method itself, i.e. the way how scientist interrelate the 'facts', i.e. the empirical data which they are able to constitute, and the 'theory', which supposedly captures the scientific knowledge for reuse such as explanation and/or prediction. However, the production of scientific knowledge has always been facing plenty of epistemological challenges because a universal basis for 'how' to acquire knowledge has never emerged and therefore every (new) approach remains subject to criticism from various perspectives.
The debate about scientific theories and the call for pluralism is more vivid in the social sciences (compared to natural and formal sciences) where observation and data collection is much more depending on the worldview of the researcher. Building on previous works defines "theory as a statement of relationships between units observed or approximated in the empirical world", where 'observed' means measureable and 'approximated' means constructed whenever which the very nature of the unit of study cannot be observed directly (e.g., centralization, satisfaction, or culture). The primary goal of a theory is to answer questions of knowledge seekers: not only what (descriptive), but also how, when, and why.
The utility (if not quality) of theories is usually considered as a function of (a) overall explanatory power, i.e. the ability to "describe and explain a process or sequence of events", and (b) the predictive power, i.e. to "understand and predict outcomes of interest, even if only probabilistically"; with reference to previous works. In social science, the importance of theory prediction is understood as being related to the sample size: "Given a large enough sample, and/or a long enough period of observation, theorists can predict on the basis of some of the worst explanations or no explanations at all. In other words, given a large enough sample and/or a long enough period of observation, one is able to predict for all the wrong reasons", p. 509f. For example, the prediction that a tossed coin will land heads up half of the time is accurate just because of statistics (if the coin is tossed in the air often enough), not because of any domain-related theory. On the contrary, theory-based prediction implies an understood cause–effect relation that, for example, predicts job satisfaction caused by job enrichment and participatory decision making, but only for a limited scope of organizations and employees.
The distinction of explanatory power and predictive power is also known in information systems (IS) research which is an interdisciplinary endeavour between computer science and social/management science, trying to understand and support socio-technical systems. For example, distinguishes five different types of theories: for analyzing and describing, for understanding, for predicting, for explaining and predicting, and for design and action. In this context we focus on the difference between explanatory models that aim to statistically test theory-driven hypotheses using empirical data (according to still dominating the IS literature) and predictive models that aim to make predictions based on models. Predictive studies include inductive discovery of relationships among variables in a given dataset, whereby the discovery is driven by techniques and algorithms, without testable a priori hypotheses about causal relationships to be explicitly formulated.
Nowadays many practical examples illustrate this shift: Google's language translator does not 'understand' language, nor do its algorithms know the contents of webpages. IBM's Watson does not understand the questions it is asked or use deep causal knowledge to generate questions to the answers it is given. There are dozens of lesser-known companies that likewise are able to predict the odds of someone responding to a display ad without a solid theory but rather based on chunks of data about the behavior of individuals and the similarities and differences in that behavior.
With the availability of an abundance of data and computing power to process this data, it seems as if the strive for probabilistic predictability will take over, and scientific utility can be achieved through data processing with less or even without theory. It seems as if the fruitful and seemingly inevitable separation of inductive and deductive research is challenged by data science as a 'competitive' approach, i.e. to extract knowledge or insights from data without a priori theories and without theoretical reflection. But is data science, and data analytics in particular, indeed a scientific method free of theoretical input?
Kitchin analyzes that epistemologically BDA is tempted to fall into the traps of empiricism (with bias in sampling, interpretation, etc.) and rather advocates data-driven science as "a reconfigured version of the traditional scientific method, providing a new way in which to build theory". It combines different approaches that are abductive (neglecting irrelevant data relations), inductive (generating propositions) and deductive (testing propositions). All of these approaches deal with theories, yet not as starting or end points, but focused on and related to the steps of the data analysis process.
If we are to expect that theory building and predictability are increasingly an outcome of (big) data processing instead of a reflected cycle of inductive and deductive research, then indeed we have to reassess the epistemological underpinnings of our research process. We aim to contribute to this discussion by focusing on the epistemological challenges of every step in the BDA process and seek to point out theoretical development in order to support BDA in the future.