Are you Certain About How Uncertain You Should Be?
This Blog Post: Are you Certain About How Uncertain You Should Be? © 2024 by Lynd Bacon & Loma Buena Associates is licensed under CC BY-SA 4.0
TL;DR? Here's the BLUF:
It has been said that love is a many-faceted thing. So is uncertainty. Because it is, the multiple sources of it in any data science or analytics effort should be well documented and expressed as precisely as possible. Doing otherwise is not good practice, and is not in the best interest of supporting the best possible decision-making based on results.
Life is chock full of uncertainty. So are research, prediction, and data analytics.
I've recently been considering some challenges in expressing and communicating uncertainty. I've also been thinking about the critical elements of reproducible data science workflows that I need to emphasize for my students and colleagues. I find myself once again reminded of how difficult it can be to understand uncertainty and to take it into account as much as possible when making decisions.
Many have made the distinction between epistemic and aleatoric uncertainty, the former being due to lack of knowledge (i.e., ignorance), and the latter, inherent randomness or variation of some sort. Uncertainty can have various sources. It accompanies most all prospects in the real world, and is unavoidably present in contexts in which decisions need to be made. A casual online search using the query "understanding uncertainty" produces many hits. Apparently uncertainty, and the understanding of it, continue to be widely popular topics. This is no surprise.
Where analytics and scientific efforts are concerned, accounting for, or at the very least acknowledging, uncertainty is critical to obtaining the best possible results-based outcomes, and for gaining better understandings of the world. Taking uncertainty into account can clarify risk for decision-makers, and it might even help with avoiding negative consequences. One of the challenges in accounting for, and expressing, uncertainty is that the quality, or precision, with which it can be done depends on the nature of the thing (e.g., fact, hypothesis, results-producing process) about which uncertainty is expressed.
Have you ever pondered the variety of decisions and judgments made during the course of a research or analytic project, and the uncertainties that come with them? Have you ever had to go back to determine if you could reproduce your own results? Have you ever had to verify whether someone else's results are reproducible? What might you be, or should you be, uncertain about?
Questions like these indicate the need for comprehensive, well documented, and reproducible analytics and data science workflows that address uncertainty. These are principled workflows. It seems to me that such workflows are important for all analytic efforts whose results or findings will be used for nontrivial purposes. This includes so-called "one off" projects.
If you are an analytics or data science provider, principled workflows should characterize your practice. As a consumer of, or a stakeholder in, the analytics efforts of others, you owe it to yourself to require complete visibility into provider workflows.
Veridical Data Science
Communicating Uncertainty
van der Bles et al. (2019) have done an interdisciplinary review of methods for expressing and communicating uncertainty about scientific findings. They propose a framework for communicating epistemic uncertainty that distinguishes between elements that include the nature of the communicators of uncertainty, the nature of the thing that there is uncertainty about, the form of the expression of uncertainty, and the characteristics of the audience for the communication. They also distinguish between the objects about which uncertainty is expressed and communicated, e.g. facts, numbers, hypotheses. Uncertainty about these objects can be expressed in absolute or relative quantitative terms, ranging from probabilities and prediction sets, to verbal summaries. They consider uncertainty about such objects to be direct uncertainty.
Indirect uncertainty, according to the authors, arises from the knowledge underlying statements about facts, numbers, or hypotheses. I note that it could also come from the processes producing the knowledge or the facts, numbers, or hypotheses. Expressions of indirect uncertainty can be classified in ordered categories or levels of uncertainty, or even in the form of qualitative statements about caveats.
van der Bles et al. (Ibid.) propose a nine point scale of precision for statements about direct uncertainty, i.e., uncertainty w.r.t. facts, numbers, or hypotheses (Fig. 2, page 9). In decreasing order of precision:
- A full explicit probability distribution
- A summary of a distribution
- A rounded number, range, or an order-of-magnitude assessment
- A predefined categorization of uncertainty
- A qualifying verbal statement
- A list of possibilities or scenarios
- Informally mentioning the existence of uncertainty
- No mention of uncertainty
- Explicit denial of uncertainty
Merchants of Certainty1
Data science and analytics providers who fail to adequately express and communicate uncertainty encourage overconfidence on the part of their stakeholders or clients. If you are a stakeholder or a client for quantitative research services, be sure to avoid such "Merchants of Certainty." Require from your providers transparent, well documented, reproducible workflows. You might even require that you be able to duplicate their work in detail. If you are provider of services, try to avoid being a "Merchant of Certainty." You'll be doing yourself, and your stakeholders or clients, a favor.
References
Communicating uncertainty about facts, numbers and science." R. Soc. Open Sci. 6: 181870. https://royalsocietypublishing.org/doi/epdf/10.1098/rsos.1818701 This is an homage to the documentary book "Merchants of Doubt (2010)." You can find it on Amazon and elsewhere. There is a website for it at https://www.merchantsofdoubt.org/.
Comments
Post a Comment