Prediction and Inference in Data Science: 5. Discussion

5. Discussion

In the previous sections, I have described prediction as the output from models for data and inference as a mechanism by which people can learn from the comparison of models to data. I have observed that there is a prevailing focus on prediction for data science in industry and shown that there is a growing focus on prediction in the research literature, but have offered examples of the importance of a balanced perspective incorporating both inference and prediction in applied work in the entertainment industry.

In this section, I will assess some of the possible implications of the trend towards prediction-oriented discourse among data scientists. In particular, because the success of data scientists in communicating about the workings and results of models is critical to the ability of an organization to learn from its data, I will explore the role that communication plays in the actual work of and outcomes from data science.

5.1. Implications of the Trend Towards Prediction

Arguably, the distinctions are purely semantic and modelers working under either the predictive or balanced perspective can choose to address the full range of modeling challenges ascribed to inference and prediction without regard for these terms. Even if so, semantic distinctions can have real consequences within organizations.

The terms we as researchers use and emphasize in educating students, communicating to the public, developing strategy with and conveying results to executives, and talking to investors also impact the paths we follow in doing data science. Data scientists and other business stakeholders are collaborators in the definition of any modeling task. When data scientists present and justify their modeling work purely in terms of predictive performance, it tends to be evaluated on that basis. When a studio embarks on box office "projection," it implicitly frames the market modeling problem around prediction outputs rather than inference and information extraction, regardless of the actual use cases for the model. If the industry nomenclature referred to box office "contributor," "driver," or "attribution" analysis, the inferential goals of this modeling task may be more prominent and perhaps the literature would more frequently discuss causal inference and analysis of uncertainty. This co-dependence emphasizes the imperative for data scientists to communicate thoughtfully, clearly, and effectively within organizations to ensure that their modeling work is aligned to business objectives.

While the rate analysis of term usage from the quantitative academic literature is not a causal temporal analysis, it is not difficult to imagine that fields of application that have historically focused on statistical inference may be slower to recognize and adopt the benefits of new techniques driven by the predictive modeling community, such as advances in deep learning, and vice versa. Companies can mitigate the effects of this siloing by constructing data science teams with diverse representation of prior methodological and applied experience, encouraging interaction across functional and disciplinary divisions, promoting external collaborations and participation in conferences, and setting an expectation for reading sources and journals from a variety of disciplines and trans-disciplinary sources.

Within the research community, the recent investment in predictive methodologies suggests an opportunity to further capitalize on inferential techniques that compliment these advancements. Exciting new work on visualizing and interpreting the activations of deep neural networks, new mechanisms for understanding the workings of machine learning models, and approaches for probabilistic deep learning are just a few exemplars of the opportunity that exists at the intersection of predictive and inferential approaches.

5.2. Data Science as a Language for Research

Communicating about any complex technical subject is challenging. Yet communication in industrial data science is further compounded by the diversity of audiences that may be stakeholders for any given model. Within a company, there may be other data scientists focused on similar problems, data scientists working with entirely different modalities of data, software engineers, creative experts (such as marketers or product designers), operations managers, executive decision makers, and more that all need to interpret and act upon the results of the same model. While there is certainly nothing new about the need for disparate actors across an organization to be coordinated with each other, the mutual agreement that they should coordinate on the basis of models of data, and data science generally, is perhaps the fundamental consequence of the "big data revolution".

In this way, data science itself can be viewed as a new common basis for communication, inquiry, and understanding in both science and industry. For example, within business, data-driven strategy development may be formulated as a decision theoretic response to statistical inferences. Operational execution in the age of automation already routinely takes the form of a predictive modeling task. In this sense, data science may manifest a new common language shared by researchers across sectors and domains.

If so, part of our common work as practitioners in the field of data science is to create and standardize the very terms of discussion for research and decision making within businesses going forward. The choices we make as a community today in how to describe our purpose and our work may reverberate for decades by framing discussions not only in classrooms and journals, but also in laboratories, offices, and board rooms. Consider the phrase "big data". It has the virtue of signaling part of what is exciting about this new field–the ability to manipulate, apply analytical methods to, and extract value from data on a scale once unimaginable–but has the vice of neglecting other aspects important to data science. It devalues the significance and complexity of work done on not-"big" datasets, which comprises much of the cutting edge work in academia and industry; it fails to invoke the coequal role of modeling in data science; and it ignores the critically important issue of data quality.

In general, there is room for data scientists to identify alternative language that communicates their meaning more clearly and directly to diverse audiences. Sometimes this can be accomplished by a translation, e.g. replacing a specific term like "singular value decomposition" with a generic term like "recommender system". Other times, it may require deliberate explication of a confusing or misunderstood concept. For example, with respect to the communication of uncertainty in sensitive areas of public interest, Morton, Rabinovich, Marshall, and Bretschneider recommended emphasizing the actionable potential of the upside of the risk profile of climate change, and Manski comment on the risks of not reporting uncertainty in the publication of economic data by government agencies. Both provide recommendations for framing estimates of uncertainty as specific and useful assessments of the variability or risk associated with a system that should have productive consequences on decisions made in response to an analysis. The optimal approach to communicating any important topic with the potential for ambiguity will vary by subject and audience and deserves the thought and consideration of the data scientist as a domain expert.

Simplicity and concision are virtues in technical communication, but they should be used to elegantly explain difficult concepts and not to obscure or avoid them. When it comes to topics like uncertainty, models that may seem like black boxes, and inference about unobservable parameters, data scientists should strive to communicate more about modeling within organizations, not less. Likewise, Manski expressed the hope that "concealment of uncertainty is a modifiable social norm" addressable by increased awareness. It is incumbent upon the data scientist to identify the most effective ways to provide context for and to explain why their model acts the way it does, why they believe its implications, how they've made relevant methodological choices and assumptions, and what caveats remain in their implementation or analysis. Not every facet of a model needs to be belabored, but the ones that are most important to the data scientist will generally have significance for the audience as well. Communication should be viewed as an integral part of an outcome from a balanced model development process.

Course Introduction

Course Syllabus

Unit 1: Business Intelligence Overview

1.1: What is Business Intelligence?

Business Intelligence

Introduction to Business Intelligence

1.1.1: What Business Intelligence is Not

Frontiers of Business Intelligence and Analytics

Business Intelligence Dashboards

1.1.2: Business Intelligence vs. Competitive Intelligence

What is Competitive Intelligence?

1.1.3: From Systems Engineering to Business Engineering

Information Architecture Analysis

Systems Engineering

Business Engineering

1.2.1: Contemporary Applications

Business Intelligence in ERP

Improving Outcomes with Business Intelligence

How Businesses Use Information

1.2.2: BI Approaches for Each Lifecycle Stage

The Business Cycle

Big Data Analytics in Supply Chain Management

1.2.3: BI for Prediction

Goal-Oriented BI

Big Data Analytics

BI System Effectiveness

Data Mining Analytics for BI and Decision Support

1.3: The Future of BI

Future Trends in Information Systems

Internet Trends

Trends in Information Technology

Technology Trends in the COVID-19 Pandemic

The Future of BI

1.3.1: Adapting Business Models to Globalization and Technology

Global Business Strategies for Responding to Cultural Differences

Internationalization and the Need of Business Model Innovation

1.3.2: Maintaining the Firm-Centric Approach

Designing BI Solutions in the Era of Big Data

1.3.3: Incorporating Data from the Internet of Things (IoT)

The Internet of Things

The Cognitive Internet of Things and Big Data

Data Science in Heavy Industry and the Internet of Things

Causality and Variables

The Internet of Things is Revolutionary

Unit 1 Discussion

Unit 1 Study Resources

Unit 1 Review Video

Study Guide: Unit 1

Unit 1 Assessment

Unit 1 Assessment

Unit 2: BI as Business Support

2.1: Defining the Problem

Choice and Happiness

2.1.1: Framing Internal Client Discussions

Overview of Managerial Decision-Making

2.1.2: Drafting the Terms of Reference (TOR)

Defining the Scope of your Project

Developing Terms of Reference

2.1.3: Negotiating the Project Scope

Scope Planning

Negotiation

2.2: The Art and Science of Decision-Making

Decision-Making in Management

Decision-Making Processes in the Workplace

2.2.1: Thinking about Thinking

Experience vs. Memory

Evidence Logs and Metacognitive Logs

2.2.2: Use Analysis, or "Go with Your Gut"?

Problem Solving, Thinking, and Intelligence

Using a Heuristics Checklist

2.2.3: Decision-Making Approaches

Decision-Making Tools

2.2.4: Structuring Decision-Making Effectively

RAPID Decision-Making

2.3: Using Data to Make Decisions

Business Intelligence Dashboards

2.3.1: Everyday Data

2.3.2: Why Expert Judgement is No Better than Yours

Why You Think You're Right Even if You're Wrong

2.3.3: How Forecasting can Help Decision-Making