This article is a bit heavy on jargon for data scientists. Still, it makes the interesting case that what we often call prediction is only making inferences, identifying trends in data, and interpreting them, not using them effectively to predict what is likely to happen next. The article also makes the point that prediction may not be the endpoint of machine learning but that providing prescriptions on what to do about likely future outcomes will become the standard soon. Be sure to read carefully through the box office, marketing, and industry trend examples to see how to apply the concepts in the article.
5. Discussion
In the previous sections, I have described prediction as the output from models for data and inference as a mechanism by which people can learn from the comparison of models to data. I have observed that there is a prevailing focus on prediction for data science in industry and shown that there is a growing focus on prediction in the research literature, but have offered examples of the importance of a balanced perspective incorporating both inference and prediction in applied work in the entertainment industry.
In this section, I will assess some of the possible implications of the trend towards prediction-oriented discourse among data scientists. In particular, because the success of data scientists in communicating about the workings and results of models is critical to the ability of an organization to learn from its data, I will explore the role that communication plays in the actual work of and outcomes from data science.
5.1. Implications of the Trend Towards Prediction
Arguably, the distinctions are purely semantic and modelers working under either the predictive or balanced perspective can choose to address the full range of modeling challenges ascribed to inference and prediction without regard for these terms. Even if so, semantic distinctions can have real consequences within organizations.
The terms we as researchers use and emphasize in educating students, communicating to the public, developing strategy with and conveying results to executives, and talking to investors also impact the paths we follow in doing data science. Data scientists and other business stakeholders are collaborators in the definition of any modeling task. When data scientists present and justify their modeling work purely in terms of predictive performance, it tends to be evaluated on that basis. When a studio embarks on box office "projection," it implicitly frames the market modeling problem around prediction outputs rather than inference and information extraction, regardless of the actual use cases for the model. If the industry nomenclature referred to box office "contributor," "driver," or "attribution" analysis, the inferential goals of this modeling task may be more prominent and perhaps the literature would more frequently discuss causal inference and analysis of uncertainty. This co-dependence emphasizes the imperative for data scientists to communicate thoughtfully, clearly, and effectively within organizations to ensure that their modeling work is aligned to business objectives.
While the rate analysis of term usage from the quantitative academic literature is not a causal temporal analysis, it is not difficult to imagine that fields of application that have historically focused on statistical inference may be slower to recognize and adopt the benefits of new techniques driven by the predictive modeling community, such as advances in deep learning, and vice versa. Companies can mitigate the effects of this siloing by constructing data science teams with diverse representation of prior methodological and applied experience, encouraging interaction across functional and disciplinary divisions, promoting external collaborations and participation in conferences, and setting an expectation for reading sources and journals from a variety of disciplines and trans-disciplinary sources.
Within the research community, the recent investment in predictive methodologies suggests an opportunity to further capitalize on inferential techniques that compliment these advancements. Exciting new work on visualizing and interpreting the activations of deep neural networks, new mechanisms for understanding the workings of machine learning models, and approaches for probabilistic deep learning are just a few exemplars of the opportunity that exists at the intersection of predictive and inferential approaches.
5.2. Data Science as a Language for Research
Communicating about any complex technical subject is challenging. Yet communication in industrial data science is further compounded by the diversity of audiences that may be stakeholders for any given model. Within a company, there may be other data scientists focused on similar problems, data scientists working with entirely different modalities of data, software engineers, creative experts (such as marketers or product designers), operations managers, executive decision makers, and more that all need to interpret and act upon the results of the same model. While there is certainly nothing new about the need for disparate actors across an organization to be coordinated with each other, the mutual agreement that they should coordinate on the basis of models of data, and data science generally, is perhaps the fundamental consequence of the "big data revolution".
In this way, data science itself can be viewed as a new common basis for communication, inquiry, and understanding in both science and industry. For example, within business, data-driven strategy development may be formulated as a decision theoretic response to statistical inferences. Operational execution in the age of automation already routinely takes the form of a predictive modeling task. In this sense, data science may manifest a new common language shared by researchers across sectors and domains.
If so, part of our common work as practitioners in the field of data science is to create and standardize the very terms of discussion for research and decision making within businesses going forward. The choices we make as a community today in how to describe our purpose and our work may reverberate for decades by framing discussions not only in classrooms and journals, but also in laboratories, offices, and board rooms. Consider the phrase "big data". It has the virtue of signaling part of what is exciting about this new field–the ability to manipulate, apply analytical methods to, and extract value from data on a scale once unimaginable–but has the vice of neglecting other aspects important to data science. It devalues the significance and complexity of work done on not-"big" datasets, which comprises much of the cutting edge work in academia and industry; it fails to invoke the coequal role of modeling in data science; and it ignores the critically important issue of data quality.
In general, there is room for data scientists to identify alternative language that communicates their meaning more clearly and directly to diverse audiences. Sometimes this can be accomplished by a translation, e.g. replacing a specific term like "singular value decomposition" with a generic term like "recommender system". Other times, it may require deliberate explication of a confusing or misunderstood concept. For example, with respect to the communication of uncertainty in sensitive areas of public interest, Morton, Rabinovich, Marshall, and Bretschneider recommended emphasizing the actionable potential of the upside of the risk profile of climate change, and Manski comment on the risks of not reporting uncertainty in the publication of economic data by government agencies. Both provide recommendations for framing estimates of uncertainty as specific and useful assessments of the variability or risk associated with a system that should have productive consequences on decisions made in response to an analysis. The optimal approach to communicating any important topic with the potential for ambiguity will vary by subject and audience and deserves the thought and consideration of the data scientist as a domain expert.
Simplicity and concision are virtues in technical communication, but they should be used to elegantly explain difficult concepts and not to obscure or avoid them. When it comes to topics like uncertainty, models that may seem like black boxes, and inference about unobservable parameters, data scientists should strive to communicate more about modeling within organizations, not less. Likewise, Manski expressed the hope that "concealment of uncertainty is a modifiable social norm" addressable by increased awareness. It is incumbent upon the data scientist to identify the most effective ways to provide context for and to explain why their model acts the way it does, why they believe its implications, how they've made relevant methodological choices and assumptions, and what caveats remain in their implementation or analysis. Not every facet of a model needs to be belabored, but the ones that are most important to the data scientist will generally have significance for the audience as well. Communication should be viewed as an integral part of an outcome from a balanced model development process.