Prediction and Inference in Data Science

3. The Rise of Prediction in the Literature

As indicated in §1, the imbalanced prediction-oriented perspective on data science has become dominant in forums ranging from the executive suites in many industries to the online community of data science learners and practitioners. This aligns with trends in the academic research community, who themselves have increasingly focused on the predictive aspect of modeling.

In this section, I demonstrate this trend by textual analysis of academic preprints. While the linguistic expression of the broad concepts of inference and prediction explored in §2 eludes strict and comprehensive quantification, a simple analysis of word frequency signals the relative rates at which researchers deploy the inferential and predictive frames.

Table 1. Synonymous terms for salient root words used in Figures 3-4.

Root Word	Synonyms
Predict	predict, predictability, predicted, predicting, prediction, predictions, predictive, predictor, predictors, predicts
Infer	infer, inference, inferences, inferencing, inferential, inferred, inferring, infers

To supply data for this analysis, I construct a textual database of research abstracts queried from the scholarly pre-print server the arXiv using their API. The arXiv is divided into domain specific categories and each category is further divided into subcategories such as astro-ph.GA ("Astrophysics of Galaxies") and cs.NE ("Computer Science: Neural and Evolutionary Computing"). The date of first publication and rate of publication varies greatly across subcategories, with some having received thousands of submissions per year for decades and others being effectively quiescent. I gathered all abstracts published from 2005-01-01 to 2018-12-31 across the 141 subcategories of the following arXiv categories, totaling more than 1.3 million abstracts: astro-ph (Astrophysics), cond-mat (Condensed Matter), cs (Computer Science), econ (Economics), eess (Electrical Engineering and Systems Science), math (Mathematics), nlin (Nonlinear Sciences), physics (Physics), q-bio (Quantitative Biology), q-fin (Quantitative Finance), and stat (Statistics). Not included are several additional categories in the physics domain such as nucl-th (Nuclear Theory) and quant-ph (Quantum Physics), which are anticipated to display similar trends as the primary physics category.

With this data, we can examine the relative frequency and time evolution of the use of "infer," "predict," and related terms in arXiv categories and sub- categories over time. I adopt the following notation: The total count of abstracts from a subcategory, , in a given year, , is denoted

$N_{s, y}$ . The concepts, , of inference and prediction are identified by the words "infer," "predict," and other semantically synonymous terms listed in Table 1. The presence of these terms is counted within abstracts from each subcategory and aggregated by year as

$N (C) _{s, y}$ . If abstracts include words from both the "infer" and "predict" lists of Table 1, they are counted in both categories. The growth of abstract submissions to a subcategory between years and is calculated simply as

$G(N) _{s, a, b} = N_{s, a} / N _{s, b}$ . The rate of use of a term

$\alpha$ in a subcategory in a given year

$\sigma$ is expressed as

$U _{s, \alpha, a} = N (\alpha)_{s, a} / N _{s, a}$ . The growth rate of the usage of the same term between years

$\alpha$ and

$\beta$ in a subcategory is calculated as

$G(U) _{s, \alpha, a, b} = U_{s, \alpha, a} / U_{s,\alpha, b}$ . The relative usage growth rate between two terms

$\alpha$ and

$\beta$ between years

$a$ and

$b$ in a subcategory is calculated as

$R _{s, a, \alpha, \beta, a, b} = G(U)_{s, \alpha, a, b} / G (U)_{s, \beta, \alpha, b}$ .

Although this word frequency analysis is a blunt measurement instrument, its results readily suggest some interesting changes over time.

Figure 3 illustrates these results for some popular arXiv subcategories. The fast growing stat.ML ("Statistics: Machine Learning") subcategory, which has grown by ∼ 8 × in annual submissions since 2012, is illustrative of the topical evolution of many subcategories. In stat.ML, inference dominated the discussion during the early period of the subcategory in the late 2000's. But the subcategory has trended quickly towards prediction over the past half-decade, with the use of prediction synonyms rising by ~25% since 2015 and inference synonyms falling by ~35% over the same period. Similarly, the cs.CV ("Computer Science: Computer Vision and Pattern Recognition") subcategory has grown by > 10 x since 2012. Prior to that time, the use of inferential terms was somewhat more common than prediction terms. Since that time, the use of both terminologies has grown in this field. But the use of predictive terms has grown at ~ 2.5x the rate of the growth of inferential terms and now dominates over them by more than ~2 x .

One of the more volatile cases is cs.AI ("Computer Science: Artificial Intelligence"). Before 2010, usage of predictive terms was stable and inferential terms increasing. Between 2010 and 2013, this field saw a dramatic rise in inference-related terms ( > over 2 x over 2 years) while predictive term usage continued at similar rates. Since 2013, prediction terms have grown by > 2x while inferential terms have fallen by half. The abstract submission rate has roughly tripled over this time period.

A final interesting direct contrast are the subcategories stat.AP ("Statistics: Applications") and stat.TH ("Statistics Theory"). stat.TH had even usage of inferential and predictive language in 2012. While the fraction of articles mentioning prediction has stayed constant since that time and total article submissions have risen a modest ~60%, the discussion of inference has grown ~50%, a trend noticeably contrary to the fast-growing subcategories discussed previously. Much the opposite, stat.AP has seen a ~70% growth in prediction discussion dating back to 2012, with a flat long term trend in the use of "infer" and synonyms. These figures suggest that discussion of statistical inference is increasingly concentrated in certain subcategories like stat.TH and moving away from subcategories where it used to be more prevalent, like stat.ML, cs.CV, cs.AI, and stat.AP.

Figure 3. Average rate of usage (U) of "infer," "predict", and synonymous terms (as defined in Table 1) within abstracts of different arXiv subcategories over time (colored lines) and standard error (colored regions). The total volume of abstracts (N) in each subcategory is also shown (grey dashed line).

The faster growing research disciplines tend to be the same ones that have increasingly focused on prediction in recent years. Figure 4 illustrates this finding at the subcategory level. Each timepoint is calculated using a rolling boxcar mean with a 3-year trailing window, i.e. an estimate of the rate at each time point from the average of sequential aggregations of the prior 3-year period. Subcategory data points with fewer than 100 articles in a year or extremely low (< 1%) usage of either term set are excluded. Evaluated as a simple correlation at the subcategory level weighted by 2018 post volume, the trend between article submission growth and the relative usage growth for "prediction" over "inference" is fairly strong (weighted correlation coefficient $p = 0.59$ ).

Figure 4. Comparison between article submission growth and the growth in the relative usage of inference and prediction terms for arXiv subcategories. The growth is calculated over the period 2012 to 2018 using a rolling boxcar mean. The shading denotes the total submission volume as of 2018. The blue line shows a least squares linear trend weighted by 2018 post volume. For clarity, only the largest (by 2018 post volume) third of the points and those with extreme growth rates are labeled.

The relation between article submission growth, $G(N)$ , and increasing predictive focus, $R$ , is also evident at the category level (Table 2). Among the categories studied here, the three fastest expanding categories are also the three with the highest trend towards predictive language usage: cs, stat and q-fin. The stable-volume astro-ph, nlin, and cond-matter categories, in contrast, are trending towards more inferential language usage.

These results show that scholarly articles in the fastest growing quantitative research fields, as proxied by abstracts posted to the arXiv, have increasingly focused on prediction since about 2012. It is reasonable to conclude that this shift in focus has contributed to the prevailing perspective in industry discussed in §1, where data science emerged as a prominent area of investment over roughly the same time period. Due to the inter-dependent nature of techniques developed in one domain and studied or applied in the other, the rise of prediction-focused literature in the academic literature may also have been driven in part by demands originating in industrial practice.

Table 2. Summary of the arXiv growth rate statistics displayed in Figure 4 aggregated by category. The growth is calculated over the period 2012 to 2018 using a rolling boxcar mean; see the text for details and definition of terms.

	Relative 'predict' usage growth: R	Article submission growth: G(N)	2018 article volume: N (1000’s)
nlin	0.45	1.01	1.81
astro-ph	0.83	1.15	19.97
cond-mat	0.96	1.18	23.57
math	0.98	1.34	44.69
q-bio	1.22	1.43	3.02
physics	0.98	1.61	19.97
q-fin	1.58	1.66	1.30
cs	1.30	2.98	71.19
stat	1.46	3.52	13.43