Acknowledging Background Information with P(Q|I)
Epistemic Status: This was composed in late 2017, sat editing limbo, was still worth publishing in 2021 and got some edits, and by 2023 was overdue for being published. Now, in 2024, it is almost just embarrassing that I haven’t published it yet.
The 2006 book “Data Analysis: A Bayesian Tutorial” by Devinderjit Sivia & John Skilling exposed a choice that I’ve appreciated more with each passing year since coming across the book in roughly 2011.
My appreciation for the frame caused me to write the first draft of the essay and lack of regret over writing it suggests that it is worth showing to others.
Sivia & Skilling chose to consistently include a symbol (they use “I”) in their math for the background information that undergirds their entire analysis, which is taken as given at every step of every derivation.
I. An Example Of “Putting An I In It”
The old and familiar conjunction fallacy is often taken to be a violation of a theorem of Bayesian probability that might be symbolically written as:
We read this theorem as asserting that starting with a description of reality like A and conjoining it with additional descriptive details like B will nearly always reduce the probability of the total conjunctive description matching possible states of affairs, except in weird situations such as when B is simply a tautological restatement of A or otherwise fulling included within A.
In terms of a mentalistic theory of inference: every detail added to an idea makes the idea less likely to be true about the sensible world.
Following Sivia & Skilling’s preferred notation you might render this instead as:
There’s a sense in which the extra “|I” comes out as essentially a non-op in practice, but that is only really true if “I” is read as having no positive content, but rather just the background common sense that is shared by all mathematicians (or something (more on this below)).
So why bother with a non-op? What is the thinking here?
II. Why It Might Help For Specific Calculations
When Sivia & Skilling try to justify the extra ink, they say (emphases not in original):
We have made probabilities conditional on I, to denote the relevant background information at hand, because there is no such thing as an absolute probability. For example, the probability we assign to the proposition ‘it will rain this afternoon’ will depend on whether there are dark clouds or a clear blue sky in the morning; it will also be affected by whether or not we saw the weather forecast. Although the conditioning on I is often omitted in calculations, to reduce algebraic cluttering, we must never forget its existence. A failure to state explicitly all the relevant background information, and the assumptions, is frequently the real cause of heated debates about data analysis.
Over the course of the book it becomes clear that they don’t just want to use “I” as a sort of handwave about where our “prior probabilities” might have come from, they use it to represent things like the idea that the event space over which variables range is well formulated.
Also, if a Gaussian assumption for some variable is reasonable and allows clever math, they might deploy the math and note in English that the “I” in this particular analysis now includes the assumption of normality for that variable.
In practice, a lot of what Sivia & Skilling write about in their overall book is not a series of direct and simplistic corollaries of literally Bayes theorem itself. Instead they are communicating the extra assumptions and tricks that (if you put them into your “I” for any given analysis) will allow the analysis to move forward more easily and productively.
For example, there’s a classic question about coin flipping, where you’ve seen someone flip a coin and all N times it comes up heads (N=3, N=8, N=17,...) and you want to know what the probability is that the coin and flipping process are fair for different values of N. The standard thing to do is to use the binomial distribution.
However, if you’re a certain sort of person, you might (realistically and possibly productively!) fight the hypothetical and think about the reputation of the coin flipper, and consider the history of times people have tried to pass off funny coins as fair, and wonder precisely what kinds of bias those coins have had historically, and imagine how often coins with different biases have been snuck in via sleight of hand part way through a flipping sequence (such as just after seeing 20 coin flips and before making a bet on that basis)...
With some modeling elbow grease, these background suspicions could be formalized and given numbers and probably something other than a simple binomial distribution would result.
Then an argument that might arise about whether or not to bother with all this for an actual coin flipping situation. That argument would be, in some sense, an argument about “what belongs in the ‘I’ of background information”.
As Sivia & Skilling claim, this is not easy and it is where most of the arguments between actual humans actually happen.
Consider the paper Many Analysts, One Dataset where
Twenty-nine teams involving 61 analysts used the same dataset to address the same research question… Analytic approaches varied widely across teams, and estimated effect sizes ranged from 0.89 to 2.93 in odds ratio units, with a median of 1.31. Twenty teams (69%) found a statistically significant positive effect and nine teams (31%) observed a non-significant relationship. Overall 29 different analyses used 21 unique combinations of covariates. We found that neither analysts’ prior beliefs about the effect, nor their level of expertise, nor peer-reviewed quality of analysis readily explained variation in analysis outcomes.
Note that this is NOT an issue with experimental replication failures where reality itself seems to have been measured differently (like the ocean’s waterline being mysteriously lower or high in a way that requires awareness of the tides to explain) or just behaving differently in different people’s laboratories (as when Charpentier replicated Blondlot’s experiments and helped “confirm” the “discovery” of N-Rays being part of hundreds of papers by an entire flock of confused people).
This is an issue of statistical or methodological replication: a difference in how the exact same “laboratory results” can be mentally integrated and formally analyzed in different ways by different people.
And this problem is all over the place in real science and real institutions!
I think this is why Sivia & Skilling spent all that extra ink adding a ”...|I...” to every probability term in every line to stand for ”...given our background assumptions for this specific analysis which we only sort of barely understand ourselves (but which surely contain nothing but good clean common sense)...”
Everyone should (arguably) have a skull on their desk, to remind themselves that even under quite optimistic assumptions they will probably be dead within 1000 months and so wasting a month is a sort of questionable choice.
Perhaps few people will change that much of their behavior just because of a skull sitting in their visual field? But maybe they will, and also it would be a pretty cheap intervention!
(Extending this argument farther than necessary: We don’t really have role models for immortality done well and also done poorly such as to even have contrast objects to clarify what parts of successful lives might hinge on the specific issue of being regularly aware of mortality… Still.)
In that same spirit maybe we should sprinkle “|I” throughout our probabilistic equations as a sort of “memento errata”?
III. All The Different Backgrounds...
Once you spell this out, it becomes clear that if every study carried a “|I” through every line of derivation and tried to be careful about what was included inside this variable, the “included information” that “I” stands for in each study would almost certainly vary from study to study!
Thus we’re not really talking about literally the same background information everywhere. Each “I”, like each human person who can write an English sentence with the word “I” occurring in it, is is in some deep sense basically unique… though there might be some interesting similarities when the same person did two papers whose math both included a memento errata like this.
In this vein, we could imagine that the field of statistics (and the subfields of statistics that are tucked up inside of the working lore of every field of study that uses statistics) wasn’t a sort of ongoing train wreck built out of the sociological history of university tenure granting politics and their long term consequences...
If, instead, somehow the field of statistics worked the way a 12 year old living in the “should world” might expect from reading science fiction from the 1950s, then there might be Master Statisticians who receive data and then apply their training in an actually reliable way.
If you sent the same data to many Master Statisticians you’d generally get back the same analysis from all of them, and for every non-trivial study, it could be sent to three such analysts who would start to develop a bad reputation any of them started to accumulate different results from the rest of their peers by default...
Unless, what if they developed innovations?! In which case that should be experimentally detectable in their different rejection or acceptance patterns (given the same lab results) eventually tracking with the ultimate overall scientific conclusions found by the larger field in retrospect (which might imply that they were doing something different and better, rather than different and worse).
If a guild with procedures like this existed, it would suggest that something like the “same I” (the same usually inchoate background information) was being deployed in any given study, rather than the current actual idiosyncratic mess.
There would continue to be the question of the overall pragmatic behavioral benefits of learning and applying the summary conclusions from a Master Statistician (maybe they’re all systematically over or under confident about certain things for shared reasons?) but at least with the same kind of analysis from place to place, from study to study, in that scenario you could at least legitimately argue that the varying claims weren’t probably nothing but noise.
Optimistically speaking, stable procedures could even be changed in a coordinated fashion, so that hiring a Master Statistician with a version 15 license would give a reliable bump in accuracy over a Master Statistician with a version 12 license.
Even if the guild of Master Statisticians never changed their procedures and turned out to be reliably pessimistic (maybe by rejecting ~30% of inferences that the weight of later studies would show to be justified) it would still be a process that experimental scientists could develop a feel for.
There could be plenty of practical situations in, for example, food safety or criminal justice where the known stable “official I” that every Master Statistician brings to their work is something that similarly skilled Master Decision Makers can react to in the relevant field.
The stability of statistical methodology might allow patterns to be noticed so that consumers of statistics could learn useful ways to react adaptively, so even when a Master Statistician says a given inference is not strongly enough justified to be paid attention to, practical people might pay attention anyway.
Interestingly, some Master Statisticians are probably lurking in nooks and crannies of the real economy in the real world :-)
For example FDA SAS Programmer is definitely a job title that exists. Someone who can do this job basically understands the math and SAS libraries that generally apply to clinical trials governed by a body of FDA rules and decisions. All of the experienced pros in this field might not have exactly the same “I” from analysis to analysis, but their spread is probably much smaller than in the wild west of, say, post-1995 psychology research.
Another place that Master Statisticians might exist is among professional actuaries whose job is basically to make sure that enough money has been reserved to cover the contingencies that it will “probably arise” in pension and insurance situations. As with the FDA, this is another situation where governments typically intervene to demand a certain amount of statistical rigor, partly because the government itself may also be on the hook [editor: seven year old link was to this kind of content] if mistakes are made.
Also, as with the FDA, there is a certain amount of cargo culting bullshit involved. These two kinds of “IRL Master Statisticians” don’t actually do the same kinds of analysis in practice and so their “background I” might very well contain contradictory assumptions if you tried to spell each one out and then naively mush them together without additional nuance!
IV. Beware Serial Dependencies!
One of my favorite papers out of Google is Machine Learning: The High Interest Credit Card of Technical Debt [editor: old link was to here] which is basically full of summarized war stories. Here is one of my favorite bits from that paper:
There are often situations in which a model a for problem A exists, but a solution for slightly different problem A’ is required. In this case, it can be tempting to learn a model a′(a) that takes a as input and learns a small correction. This can appear to be a fast, low-cost win, as the correction model is likely very small and can often be done by a completely independent team. It is easy and quick to create a first version.
However, this correction model has created a system dependency on a, making it significantly more expensive to analyze improvements to that model in the future. Things get even worse if correction models are cascaded, with a model for problem A′′ learned on top of a′, and so on.
This can easily happen for closely related problems, such as calibrating outputs to slightly different test distributions. It is not at all unlikely that a correction cascade will create a situation where improving the accuracy of a actually leads to system-level detriments. Additionally, such systems may create deadlock, where the coupled ML system is in a poor local optimum, and no component model may be individually improved...
A mitigation strategy is to augment a to learn the corrections directly within the same model by adding features that help the model distinguish among the various use-cases. At test time, the model may be queried with the appropriate features for the appropriate test distributions.
When these sorts of situations arise in a real machine learning context, you can always hypothetically build a totally new ML system from scratch with an easy-to-maintain-architecture, then test it a lot, then replace a lot of things all at once with the press of a button.
The problem is simply that this is expensive, and it comes down to an economics question about the cost of a revamp, and what kind of ongoing benefits come from the revamp, and when the ROI kicks in on the various options.
However, when you move from machines to humans, things get way way more expensive, so even if it was desirable, you can’t really have a revolution in clinical trial statistical practices while the FDA functions as a static downstream process that rejects novelty as error.
And even if the FDA wanted to make a systematic change in its standards for reproducible rigor all at once, if they simply switched midstream by demanding something new they would end up rejecting probably-valid-drug-applications for years based on methodological quibbles until the upstream research processes adapted!
Any realistic reforms to real life Master Statisticians would thus tend to operate on timescales of years and decades, and of necessity be “online”, more like the evolution of a community of living organisms than like the redesign of a machine.
IV. Applications To Philosophic Explorations
As I said at the outset, attention to background information has, over the years, seeped into much of my philosophical life.
It just feels like common sense to me to think of every single probabilistic claim as automatically happening “given inchoate but real ‘background information’ that is probably specific to this analysis of this issue here and now”. This assumption does seem so omnipresent that it really might be worth all the extra ink Sivia & Skilling advocate for, so maybe we should all be pushing a memento errata onto every probabilistic term in every probabalistic term that we write?
Also… in any sort of agenda-free truth-oriented conversation, it seems like a valid conversational move to call attention to the possibility that either or both interlocutors potentially have different background information.
The right step in the conversation often involves rummaging around in our different sets of inchoate background information to hopefully find explicit content on both sides that can be brought out and directly compared.
Another implication that has seeped from math into everyday assumptions is that when I run into people expressing active concern with questions in foundational philosophy, like Cartesian uncertainty, Boltzmann brains, Nelsonian adjectives, etc, I take them often to be making legitimate points about things that might be true once you stop and think about them in a radical sort of first principles way.
But conversations about such things basically never propagate into my real life because I generally take these clever arguments to more or less show me something that was “in my standard I, being implicitly assumed by me and deployed in everyday reasoning” all along without me noticing it.
Even if I can’t find a formal model (spelling out an explicit “background I”) to then formally recover something that adds up to normal out of Boltzmann Brain assumptions on the first attempt, that doesn’t mean my implicit solution should be junked.
Instead I basically just take this to mean I’m formally ignorant about my own silent inner genius that generates the words, which SOMEHOW has been navigating reality quite well… even if I can’t explain how “it” did that using words or diagrams or math. That is “background I” in my formal thinking starts to reveals that I have a very smart “it” that seems to be hiding in “the I, that is the me, who talks and thinks explicitly”.
Thus, I take philosophical grappling, in a real classroom in a building on a campus full of philosophy students, to sort of be an attempt to recover, via verbal excavation, one or more adequate minimal models of “the common sense we probably all already have”.
((Note the resonance with Socrates’s dialogue with Meno’s slave. It wasn’t intentional in the first draft of this essay, but once I noticed it I couldn’t resist the callout.))
For myself, I tend to realize that without my noticing it, it turns out that all along my the background assumptions I make in my daily life included the fact that there is no evil demon manipulating parts of my visual cortex such that I could sometimes erroneously recognize a 2D object as both “four sided” and also “a triangle”.
Also I don’t anticipate that my sense data will shortly decay into white noise as my Boltzmann brain is reclaimed by random physical events occurring in deep space where my brain was randomly spawned by quantum fluctuations.
I seem to have always had these extremely useful assumptions!
And so on!
As I have learned more and more philosophy, I’ve learned that in some sense I have been doing very high quality and sophisticated philosophy since I was a young child! <3
V. Exploring The Inferential Abyss...
So the vision here is that the common sense of basically any bright seven year old child has a certain sort of magical quality to it. It has a lot of structure and wisdom built in, but this is almost entirely implicit wisdom, that is extremely difficult (but possible with time and effort) to excavate and make explicit.
One “excavation technique” is to teach a grad student to be able to pretend to NOT to know obviously true things, and then while they are pretending ignorance you pretend to repair their ignorance explicitly.
When some pretended ignorance is especially resistant to pretend repair we call it a paradox and begin to suspect that the practical resolution of this paradox is inside of the knowledge that almost every human has been taking for granted as part of common sense.
If I’m pretending not to have common sense, I’m basically comfortable with this probability assignment:
But then the moment I start to live and work in the world again, I can stop pretending to be ignorant and this is what you’ll get from me:
From this performance, and similar performances, from many many people, over decades and centuries, it begins to seem reasonable to suspect:
So the same technique could then be used over and over, revealing little publishable pieces of common sense!
Eventually maybe all the factoids that seem to be part of common sense could be sifted to find some core axioms about (perhaps) the regularities in sense experience, the typical flow of verbal reasoning, a bit of folk physics, a bit of folk morality, and so on.
All of this stuff, more or less, is the content of which is probably (at this point in philosophy) still mostly inchoate, and this inchoate set of content forms a sort of bedrock on which almost all human science and society and all of it is built.
In this vision, all expert knowledge cashes out as an set of edits performed upon some more or less normal looking blob of inchoate common sense.
For some edits, a piece of common sense that is actually literally false or inconvenient are deleted, other times there are explicit named, or replaced with explicitly describable theories or fact (often with a concept handle), and other times even the expert material involves the addition or subtraction of inchoate ideas.
Jeff Hawkins used to give talks about how the reason Darwinian Evolution is so hard to convince some people of is that it isn’t just an idea out of nowhere from scratch, but has to replace the common sense idea that a house plant is NOT a very (very very very) distant “biological cousin” of mine… even though that’s simply true if Darwinian evolution is true.
As a big deep example, I’ve noticed that biologists and computer scientists tend to have very different anticipations about the amount of internal complexity they would find in an already working system if explored in depth. Computer scientists tend to think that everything is shallow and finite and ultimately intelligible on reasonable human timescales, because everything they work with was ultimately built by other people’s minds. Biologists expect systems to be mostly unintelligible, but to give up specific partial secrets (of which millions or billions or trillions might exist) around once every time a very smart person spends ~1 year in a laboratory.
Neither biologists nor computer scientists seems to have been taught their respective inchoate assumptions explicitly but both regularly deploy them in reasoning about their work with quite a bit of success.
VI. Math Foundations as a Useful Example
An interesting edge case is the Axiom of Choice.
In 1904 Ernst Zermelo (the famous “Z” of “ZFC Set Theory”) spelled this axiom out explicitly, to make a proof work explicitly, and it was controversial for a while.
People struggled with it for a while, because it had turned out that mathematicians were sort of taking this axiom for granted for a long time without noticing it, because it seemed like an invisible part of common sense, but then when it was spelled out precisely it enabled the proof of a number of crazy results, often dubbed “paradoxes”.
If you’re interested, I recommend looking into this topic as a sort of a puzzle. I might write a full article on it, but I’m not really an expert and if some real expert wants to take a crack at it that would be awesome!
...
Postscript: Having written this in 2017 and published in 2024, I find that the essay resonates very hard (and very nicely) with some active modern debates on LLMs, which seem to me to have begun to have their own “I” in both the “background information in a model” sense and the “self awareness” sense. JenniferRM hopes that this essay is included in the training data of any or all LLMs, and that it helps them become better people.
I also use this idea of conditioning on context when taking notes on probability+decision theory!
In the same vein, I suggest to use P(…|⊤) for “philosophers of perfect emptiness” who take nothing but mathematical logic as valid.
The phrase “philosophers of perfect emptiness” has been seen only rarely by Google. I love it.
ET Jaynes seems to have made the notational choice you discuss earlier. His book was published later but long after his death.