This article is a bit heavy on jargon for data scientists. Still, it makes the interesting case that what we often call prediction is only making inferences, identifying trends in data, and interpreting them, not using them effectively to predict what is likely to happen next. The article also makes the point that prediction may not be the endpoint of machine learning but that providing prescriptions on what to do about likely future outcomes will become the standard soon. Be sure to read carefully through the box office, marketing, and industry trend examples to see how to apply the concepts in the article.
1. Introduction
The explosive uptake of data science in industry can be attributed to the enormous innovation enabled by pooling developments in quantitative methods across disparate disciplines, and the additional potential emergent at their intersection. Interdisciplinary research in all fields unlocks tantalizing possibilities, but data science is unique in that it brings together two domains–statistics and computation–that are integral to essentially all fields of science, engineering, the digital humanities, and related fields in academia as well as industry. For many practitioners, the excitement of reading new work or participating in conferences in data science is driven by the opportunity to encounter a diversity of ideas; to learn from the hard-won example of methods that have incubated within varied fields.
But as marvelous as advancements in machine learning and other data science methodologies and technologies may be, they do not create value for business on their own. Value is created when people have the insight to apply these techniques to new problems, to extend their capabilities beyond what was originally contemplated, or to use them as tools that support people in making good decisions and taking appropriate actions.
In "An Executive's Guide to Machine Learning," Pyle and San Jose defined three stages to the application of machine learning, data science, and artificial intelligence in the business world. They call these stages "description," "prediction," and "prescription". This framework has been adopted widely in the business community. They branded the "description stage" as "Machine Learning 1.0," the collection of data in databases to facilitate online processing and question answering. They defined the "prediction" stage, which they denoted as the current state of the art, to mean using models to predict future outcomes. Reflecting the present "urgency" they associated with businesses' adoption of this capability, they used the term "prediction" or related conjugations 10 times in their nine-page article.
It is understandable that a contemporary observer would form the perspective that prediction has been the principle preoccupation of data science. For example, the popular online platform Kaggle has engaged hundreds of thousands of users, some veterans and some first-time modelers, to participate in data science competitions since 2010. Kaggle has become a highly influential and constructive entry point into the practice of data science and experience on the platform is frequently cited by job seekers and recruiters as a key way to build credentials for the data science job market. Kaggle always frames its competitions as prediction challenges: the purpose of the actions of data scientists in Kaggle competitions is defined to be the improvement of predictive performance metrics. There is rich discussion on the platform of how users can improve the scores of their models, but relatively little discussion of what can be learned about the systems they are modeling from their models' development and application.
Finally, Pyle and San Jose anticipate a third stage, "prescription," that involves human learning from and interpretation of models to explain why outcomes occur the way they do, which they present as the aspirational future of machine learning. But to scientists, "prescription" is a highly recognizable modality that may be broadly translated to the statistical term "inference". Referring to inference explicitly, Pyle and San Jose urged practitioners to move beyond "classical statistical techniques [that] were developed between the 18th and early 20th centuries for much smaller data sets than the ones we now have at our disposal". They could have looked back even farther in time: it is not an exaggeration to say that the origins of this kind of "prescriptive" rational reasoning from data facilitated by conceptual and mathematical models can be traced across 4,000 years of the history of science.
Applications of inferential reasoning to modern technologies and problems already motivates much of modern science. To name a few examples: generations of advancement in causal inference enables measurement of the causal effects of salient interventions from uncontrollable observational datasets, new algorithms for Bayesian inference enable expectations to be computed over high dimensional models that capture the behavior of complex probabilistic systems, and the field of interpretable machine learning has generated elegant mechanisms to explain so-called "black box" models in comprehensible terms. All these methods are already in use across virtually every field of science in one form or another. In agreement with Pyle and San Jose, it is certainly valuable for business executives to move beyond the strategic goal of prediction and to recognize the opportunity for data science to enhance our understanding of data and the systems that generate it.
In this article, I offer an accessible introduction to the duality between inference and prediction in data science intended for practitioners in industry. Using evidence from a textual analysis of research abstracts from technical preprints, I show that there is a growing imbalance such that prediction is increasingly dominant in the marketplace of ideas for data science and draw connections to the circumstance in industry described above. I then illustrate the mutual dependence and complimentary importance of inference and prediction using examples from the entertainment industry, focusing on the box office projection and advertising attribution tasks. Finally, I examine some implications of these trends for how organizations conceive of and communicate about data science.