Prediction and Inference in Data Science

2. The Duality of Inference and Prediction

The terms inference and prediction are used widely and not entirely consistently across the connected domains of data science, from theoretical statistics to computer science to medicine to entertainment and beyond, and in everyday parlance. These variations in conceptualization, terminology, and even mathematical notation make it challenging to communicate clearly about high level concepts to an audience as diverse as industrial data scientists. I will attempt to tackle this challenge here by appealing to descriptions, examples, illustrations, and clarifying contrasts I have found useful in discussions with colleagues.

figure 1


Figure 1. A graphical diagram of a simple supervised machine leaning model. The observed variables are outlined in blue and unobserved variables of the model in grey, the green plate represents the dimensionality of the data, n.


2.1.  Definitions and perspectives

I define the terms inference and prediction in practical terms as follows:
Predictions: The outputs emitted by a model of a data generating process in response to a specific configuration of inputs.

Inferences: The information learned about the data generating process through the systematic comparison of predictions from the model to observed data from the data generating process.

To elaborate, consider the straightforward case of a linear regression model. With respect to the concepts of inference and prediction, this example is generally representative of predictive model-based inference and supervised machine learning.

Figure 1 shows the model's essential components, enumerated as follows:

A set of independent variables, x_i, that are observed and provided to the model as input data or "predictors".

A dependent variable, y_i, that is also observed and provided to the model as training examples of output data.

Predictions, \tilde y_i, synthetic output data that are generated by the model and intended to match y_i as well as possible.

A set of inferred parameters, \beta, that serve to transform the inputs into the predictions.


An uncertainty measure,  \sigma, that characterizes the magnitude of typical errors in predictions.

The variables x_i and y_i are "observables," while \beta and \sigma are model parameters representing features of the model that cannot be directly observed. The prediction \tilde y_i represents the best fit of the model to the observed variable y_i.

A sample application from the entertainment industry, which will be detailed in §4.1, is the modeling of box office outcomes for new theatrical film releases. In box office projection models, the independent variables are typically characteristics of films such as their cast composition and genre classification, among many others. The dependent variable of interest may be the total box office gross of a film. In the linear regression case, the β parameters directly represent the effects of each independent variable on the dependent variable, such as the additional dollars in gross attributed to the selection of a genre favored by audiences. The uncertainty measure indicates the amount of variance expected between the revenue predicted by the model and its actual outcome.

Both inference and prediction are truly integral to the functioning of the model and both have effects that cannot be ignored in the practical application of the model. The parameters can only be learned by comparison of the predictions to dependent variable observations through the model training process. If the model's predictions do not match reasonably well the observed outcomes of the data generating process, inferences about that process will be unreliable. Likewise, the predictions themselves are generated directly from the combination of the parameters and the independent variables. If the operation of the model through its parameters cannot be explained and understood, there will be little basis to build confidence in the model's predictive ability for future realities or in new domains.

However, I have observed that many analysts and organizations choose to invest their time and attention primarily on one function or the other, and perceive of the relative importance of inference and prediction as imbalanced. In particular, as discussed in §1, a notion that prevails in many circles of industry is that prediction is the primary concern of data science. Figure 2 contrasts the two perspectives on the example model. Panel a) illustrates the "balanced perspective," where the parameter inferences and predictive outputs are viewed with equal interest. Panel b) highlights the "prediction-oriented" perspective, where the predictive outputs of the model carry outsized interest compared to the other essential elements of the model.

figure 2

Figure 2. Illustration of perspectives on the sample model of Figure 1: a) a balanced perspective equally valuing inference and prediction and b) a prediction-oriented viewpoint.

Figures 1 and 2 serve to illustrate that prediction and inference are two distinct goals of the modeling process which both offer value to organizations and are inextricably connected to each other in the modeling process, but can be viewed in different ways. Both perspectives are valid in different contexts and it is important for analysts and organizations to consider and recognize the appropriate orientation for a particular data science project.

In some operational applications in industry, the predictive outputs of a model will be integrated directly into an automated system and the values of the inferred parameters and other model behaviors will never be inspected; this exemplifies the prediction-oriented perspective. For example, the developer of an online streaming platform implementing a collaborative filtering algorithm may deploy the predictive outputs of their model to provide recommendations. The recommendations are obtained by fitting users' time spent viewing video on the platform, perhaps without concern for the parameters of this model or the drivers of the users' behavior. In this "black box" modeling regime, the parameters are simply a means to an end; nuisances that can be entrusted to a well engineered automated learning framework and overlooked thereafter. (That said, there are numerous inferential insights that can be extracted about consumer content preferences, and about the content itself, from collaborative filtering algorithms, e.g. Tintarev and Masthoff, 2015.)

Guidotti et al. provide a useful criteria for when truly black box predictive models are appropriate and, therefore, when inference, explanation, or interpretability are unnecessary: "an explanation could be not required if there are no decisions that have to be made on the outcome of the prediction". Of course, in many contexts in science and industry broadly, making decisions on the basis of data is the primary underlying motivation for applying data science.

At the other extreme, the predictive outputs of a model may be used solely as a means for model fitting in order to produce inferences and may be scarcely commented on; an "inference-oriented perspective". For example, an astronomer may make careful measurements of the brightness of a supernova explosion for the purpose of inferring the physical parameters of a progenitor star through the comparison of the brightness observations to models motivated by astrophysical theory. In this context, future predictions of the observables are uninteresting in and of themselves. Identifying the physical parameters of the stars is the goal of the study, though these parameters cannot be measured directly; even if it were possible to place a star on a balance to measure its mass, the observation of the supernova itself follows the conflagration of the star. The brightness measurements are a means to an end; incidental observables that serve the purpose of constraining the values of physical stellar parameters through model training validated by predictive performance against those observables.

As I will explore further below, there is evidence that the prediction-oriented perspective is increasingly dominant in many fields and I argue that there would be benefit to more frequent use of the balanced viewpoint.


2.2.  Conceptual parallels

The duality between inference and prediction as defined in this section parallels, but is distinct from, other well known conceptual dualities that confront data scientists. Here I examine a few related concepts to clarify the distinctions between them.

First, Breiman identified a conflict of "culture" in statistical modeling, identifying a "data modeling culture" that operates on the assumption that there exists a parameterizable model that can explain the data generating process and an "algorithmic modeling culture" that assumes that "Nature forms the outputs y from the inputs x by means of a black box with complex and unknown interior". Breiman asserted that 98% of all statisticians at the time of his writing belonged to the data modeling culture, while the algorithmic modeling culture was already dominating in other fields. He advocated for the use of algorithmic models by exploring the Occam dilemma: "Accuracy generally requires more complex prediction methods". Breiman's debate between modeling cultures, or model types, is not the duality I examine here. Instead, the duality explored in this section corresponds to Breiman's two "goals" for analyzing data of "prediction" (similar to my definition above) and "information" (similar to my definition for inference above), rather than the two "approaches" of data and algorithmic modeling. Both goals can be pursued via either approach. The simultaneous pursuit of Breiman's two goals would be analogous to the balanced perspective advocated in this article. Furthermore, information extraction (Breiman's term) or inference (mine) from a model need not be confined to parameter estimation, as in the example above. Other methods for analysts to extract information from and interpret the modeling process are discussed in §5.1 and elsewhere in this article.

Second, I distinguish my definition of inference from the narrow domain of frequentist hypothesis testing. In some domains, particularly psychology, statistical inference has historically been synonymous with hypothesis testing. In my formulation, hypothesis testing would be one approach among a broad class of methodologies for learning about the data generating process that also includes Bayesian methods, techniques for interpreting deep learning models, and others discussed elsewhere in this article. My definition of "inference" is more similar to the concept of "scientific inference" discussed by e.g. Hubbard, Haig, and Parsa: "discovery of replicable and empirically generalizable findings". In an exploration of the purpose of hypothesis testing, p-values, and significance levels, Billheimer advocate for "Predictive inference" summarized as follows: "Rather than infer the value of a parameter that can never be observed, our inferential focus should be the prediction of future observable quantities". Billheimer's recommendation, building on work by de Finetti, Geisser, and others, is that testable predictions of future observable values should be the currency for evaluating the performance of a model and identifying the reliability of inferences about parameters. This is compatible with the balanced perspective advocated in this section and similar to the concept of "correspondence to observable reality" articulated as a virtue of statistical practice by Gelman and Hennig.

Next, I consider the familiar distinction between correlation and causation. The rich literature on causal inference carefully defines the meaning of the causal effect of an intervention assigned to a unit and establishes multiple frameworks and a variety of empirical methods for measuring causal effects from observational and experimental data. In both business and research settings, constraints on the ability to control assignment mechanisms and other system factors often limit the extent to which causal effects can be isolated and measured. As a result, it is often necessary for data scientists to, for example, analyze descriptive correlations within datasets or to model data with known (and unknown) confounding variables that may not be fully observed. For analyses focused on the goal of prediction, data scientists must recognize how these limitations affect the generalizability of their models. A predictive model that learns a correlation between a particular predictor and a dependent variable of interest may perform poorly on out of sample cases where an unobserved confounding predictor or an additional cause has changed. For analysis focused on the goal of inference, it is critical to understand the limitations of a dataset or study design for identifying causation to avoid over-interpreting inferences.

Finally, inferences under any particular model (however simple or complex) are subject to the assumption that the model accurately describes the data generating process. Amrhein, Trafimow, and Greenland suggests treating inferential statistics as "unstable local descriptions of relations between models and the obtained data". Analysts should fit a variety of models, systematically compare their performance, and generalize when possible using continuous model expansion to help mitigate the effects of this localization. Ultimately these considerations provide an explication of how inference is a useful procedure for analysts and organizations to learn from data. The iterative process of designing models, applying them to data, checking their predictive performance, and interpreting the models' parameters and behavior all promotes understanding of the data generating system being modeled.