the probability of observing something with a low probability is very different from the probability of observing specifically that low probability event
Right. For example, suppose you have a biased coin that comes up Heads 80% of the time, and you flip it 100 times. The single most likely sequence of flips is “all Heads.” (Consider that you should bet heads on any particular flip.) But it would be incredibly shocking to actually observe 100 Headses in a row (probability 0.8¹⁰⁰ ≈ 2.037 · 10⁻¹⁰). Other sequences have less probability per individual sequence, but there are vastly more of them: there’s only one way to get “all Heads”, but there are 100 possible ways to get “99 Headses and 1 Tails” (the Tails could be the 1st flip, or the 2nd, or …), 4,950 ways to get “98 Headses and 2 Tailses”, and so on. It turns out that you’re almost certain to observe a sequence with about 20 Tailses—you can think of this as being where the “number of ways this reference class of outcomes could be realized” factor balances out the “improbability of an individual outcome” factor. For more of the theory here, see Chapter 4 of Information Theory, Inference, and Learning Algorithms.
But how can I apply this sort of logic to the problems I’ve described above? It still seems to me like I need in theory to sum over all of the probabilities in some set A that contains all these improbable events but I just don’t understand how to even properly define A, as its boundaries seem fuzzy and various thing “kinda fit” or “doesn’t quite fit, but maybe?” instead of plain true and false.
Right. For example, suppose you have a biased coin that comes up Heads 80% of the time, and you flip it 100 times. The single most likely sequence of flips is “all Heads.” (Consider that you should bet heads on any particular flip.) But it would be incredibly shocking to actually observe 100 Headses in a row (probability 0.8¹⁰⁰ ≈ 2.037 · 10⁻¹⁰). Other sequences have less probability per individual sequence, but there are vastly more of them: there’s only one way to get “all Heads”, but there are 100 possible ways to get “99 Headses and 1 Tails” (the Tails could be the 1st flip, or the 2nd, or …), 4,950 ways to get “98 Headses and 2 Tailses”, and so on. It turns out that you’re almost certain to observe a sequence with about 20 Tailses—you can think of this as being where the “number of ways this reference class of outcomes could be realized” factor balances out the “improbability of an individual outcome” factor. For more of the theory here, see Chapter 4 of Information Theory, Inference, and Learning Algorithms.
But how can I apply this sort of logic to the problems I’ve described above? It still seems to me like I need in theory to sum over all of the probabilities in some set A that contains all these improbable events but I just don’t understand how to even properly define A, as its boundaries seem fuzzy and various thing “kinda fit” or “doesn’t quite fit, but maybe?” instead of plain true and false.