Are you Certain About How Uncertain You Should Be?

TL;DR? Here's the BLUF:

It has been said that love is a many-faceted thing. So is uncertainty. Because it is, the multiple sources of it in any data science or analytics effort should be well documented and expressed as precisely as possible. Doing otherwise is not good practice, and is not in the best interest of supporting the best possible decision-making based on results.

Life is chock full of uncertainty. So are research, prediction, and data analytics.

I've recently been considering some challenges in expressing and communicating uncertainty. I've also been thinking about the critical elements of reproducible data science workflows that I need to emphasize for my students and colleagues. I find myself once again reminded of how difficult it can be to understand uncertainty and to take it into account as much as possible when making decisions.

Many have made the distinction between epistemic and aleatoric uncertainty, the former being due to lack of knowledge (i.e., ignorance), and the latter, inherent randomness or variation of some sort. Uncertainty can have various sources. It accompanies most all prospects in the real world, and is unavoidably present in contexts in which decisions need to be made. A casual online search using the query "understanding uncertainty" produces many hits. Apparently uncertainty, and the understanding of it, continue to be widely popular topics. This is no surprise.

Where analytics and scientific efforts are concerned, accounting for, or at the very least acknowledging, uncertainty is critical to obtaining the best possible results-based outcomes, and for gaining better understandings of the world. Taking uncertainty into account can clarify risk for decision-makers, and it might even help with avoiding negative consequences. One of the challenges in accounting for, and expressing, uncertainty is that the quality, or precision, with which it can be done depends on the nature of the thing (e.g., fact, hypothesis, results-producing process) about which uncertainty is expressed.

Have you ever pondered the variety of decisions and judgments made during the course of a research or analytic project, and the uncertainties that come with them? Have you ever had to go back to determine if you could reproduce your own results? Have you ever had to verify whether someone else's results are reproducible? What might you be, or should you be, uncertain about?

Questions like these indicate the need for comprehensive, well documented, and reproducible analytics and data science workflows that address uncertainty. These are principled workflows. It seems to me that such workflows are important for all analytic efforts whose results or findings will be used for nontrivial purposes. This includes so-called "one off" projects.

If you are an analytics or data science provider, principled workflows should characterize your practice. As a consumer of, or a stakeholder in, the analytics efforts of others, you owe it to yourself to require complete visibility into provider workflows.

Veridical Data Science

Yu & Kumbler (2020) describe a framework for doing principled data science that they refer to as veridical data science. Their framework has three core principles: predictability, computability, and stability. Predictability is important because of its centrality in scientific efforts: models and theories should be able to predict new observations, serving as a kind of reality check. All data science models are some sort of abstraction of reality. Their connection to the real phenomena of interest should be verified.

As the authors define it, stability is a broad concept that subsumes statistical uncertainty, and that covers uncertainty and risk across the entire spectrum of a data science effort. It can be assessed using methods that include perturbing elements of an analytics project, from problem formulation to expression of results. All elements that involve decisions should be examined in this regard.

Computability is an essential veridical data science consideration because computing resources constrain how data can be obtained, how data can be processed, the kinds of models that can be attempted, and the ways that results can be expressed and communicated.

Yu & Kumbler's (Ibid.) veridical data science framework combines principled workflows with documentation. A result of serious use of the framework should be analytics and data science that are reproducible and that have demonstrative utility.

Communicating Uncertainty

van der Bles et al. (2019) have done an interdisciplinary review of methods for expressing and communicating uncertainty about scientific findings. They propose a framework for communicating epistemic uncertainty that distinguishes between elements that include the nature of the communicators of uncertainty, the nature of the thing that there is uncertainty about, the form of the expression of uncertainty, and the characteristics of the audience for the communication. They also distinguish between the objects about which uncertainty is expressed and communicated, e.g. facts, numbers, hypotheses. Uncertainty about these objects can be expressed in absolute or relative quantitative terms, ranging from probabilities and prediction sets, to verbal summaries. They consider uncertainty about such objects to be direct uncertainty.

Indirect uncertainty, according to the authors, arises from the knowledge underlying statements about facts, numbers, or hypotheses. I note that it could also come from the processes producing the knowledge or the facts, numbers, or hypotheses. Expressions of indirect uncertainty can be classified in ordered categories or levels of uncertainty, or even in the form of qualitative statements about caveats.

van der Bles et al. (Ibid.) propose a nine point scale of precision for statements about direct uncertainty, i.e., uncertainty w.r.t. facts, numbers, or hypotheses (Fig. 2, page 9). In decreasing order of precision:

A full explicit probability distribution
A summary of a distribution
A rounded number, range, or an order-of-magnitude assessment
A predefined categorization of uncertainty
A qualifying verbal statement
A list of possibilities or scenarios
Informally mentioning the existence of uncertainty
No mention of uncertainty
Explicit denial of uncertainty

Clearly, the deeper you go on this scale, the worse it gets in terms of accounting for epistemological uncertainty. Towards the bottom, things go from laziness or incompetence to blatant dishonesty.

Merchants of Certainty¹

Data science and analytics providers who fail to adequately express and communicate uncertainty encourage overconfidence on the part of their stakeholders or clients. If you are a stakeholder or a client for quantitative research services, be sure to avoid such "Merchants of Certainty." Require from your providers transparent, well documented, reproducible workflows. You might even require that you be able to duplicate their work in detail. If you are provider of services, try to avoid being a "Merchant of Certainty." You'll be doing yourself, and your stakeholders or clients, a favor.

References

van der Bles, A. M. ,

van der Linden, S. ,

Freeman, L. J.,

Mitchell, J.,

Galvao A. B.,

Zaval L. and

Spiegelhalter D.

(2019) "

Communicating uncertainty about facts, numbers and science." R. Soc. Open Sci. 6: 181870. https://royalsocietypublishing.org/doi/epdf/10.1098/rsos.181870

Yu, B. & Kumbler, K. (2020) "Veridical Data Science," PNAS 117(8) 3920-3929 https://www.pnas.org/doi/full/10.1073/pnas.1901326117

1 This is an homage to the documentary book "Merchants of Doubt (2010)." You can find it on Amazon and elsewhere. There is a website for it at https://www.merchantsofdoubt.org/.

Search This Blog

Loma Buena Peculiar Inklings