Substantial? No—it adds up to normality. Interesting? Yes.
I don’t understand what you mean by this.
Imagine two situations of equal improbability:
In one, Alice flips a coin N times in front of a crowd, and achieves some specific sequence M.
In the other, Alice flips a coin N / 2 times in front of a crowd, and achieves some specific sequence Q; she then >opens an envelope, and reveals a prediction of exactly the sequence that she just flipped.
These two end results are equally improbable (both end results encode N bits of information—to see this, imagine that the envelope contained a different sequence than she flipped), but we attach significance to one result (appropriately) and not the other. What’s the difference between the two situations?
It is important to note that to capture this problem entirely we must make it explicit that the person observing the coin flips has not only a distribution over sequences of coin flips, but a distribution over world-models that produce the sequences. It is often implicit, and sometimes explicitly assumed, in coin flipping examples, that a normal human flipping a fair coin is something like our null hypothesis about the world. Most coins seem fair in our everyday experience. Alice correctly predicting the sequence that she achieves is evidence that causes a substantial update on our distribution over world-models, even if the two sequences are assigned equal probability in our distribution over sequences given that the null hypothesis is true.
You can also imagine it as the problem of finding an efficient encoding for sequences of coin flips. If you know that certain subsequences are more likely than others, then you should find a way to encode more probable subsequences with less bits. Actually doing this is equivalent to forming beliefs about the world. (Like ‘The coin is biased in this particular way’, or ‘Alice is clairvoyant.’)
Alice correctly predicting the sequence that she achieves is evidence that causes a substantial update on our distribution over world-models, even if the two sequences are assigned equal probability in our distribution over sequences given that the null hypothesis is true.
Except that we’re not updating all distributions of all possible world-models, or every single sequence would be equally surprising. You’re implicitly looking for evidence that, say, Alice is clairvoyant—you’ve elevated that hypothesis to your awareness before you ever looked at the evidence.
Except that we’re not updating all distributions of all possible world-models, or every single sequence would be equally surprising.
If you don’t even know what you mean by surprise (because that’s what we’re ostensibly trying to figure out, right?), then how can you use the math to deduce that some quantitative measure of surprise is equal in all cases?
I still think this is just a confusion over having a distribution over sequences of coin flips as opposed to a distribution over world-models.
Suppose you have a prior distribution over a space of hypotheses or world-models M, and denote a member of this space as M’. Given data D, you can update using Bayes’ Theorem and obtain a posterior distribution. We can quantify the difference between the prior and posterior using the Kullback-Leibler divergence and use it as a measure of Bayesian surprise. To see how one thing with the same information content as another thing can be more or less surprising, imagine that we have an agent using this framework set in front of a television screen broadcasting white noise. The information content of each frame is very high because there are so many equally likely patterns of noise, but the agent will quickly stop being surprised because it will settle on a world-model that predicts random noise, and the difference between its priors and posteriors over world-models will become very small.
If in the future we want to keep using a coin flip example, I suggest forgetting things that are so mind-like as ‘Alice is clairvoyant’, and maybe just talk about biased and unbiased coins. It seems like an unnecessary complication.
If you don’t even know what you mean by surprise (because that’s what we’re ostensibly trying to figure out, right?), then how can you use the math to deduce that some quantitative measure of surprise is equal in all cases?
Because the number of bits of information is the same in all cases. Any given random sequence provides evidence of countless extremely low probability world models—we just don’t consider the vast majority of those world-models because they aren’t elevated to our attention.
If in the future we want to keep using a coin flip example, I suggest forgetting things that are so mind-like as ‘Alice is clairvoyant’, and maybe just talk about biased and unbiased coins. It seems like an unnecessary complication.
It’s both necessary and relevant. Indeed, I crafted my example to make your brain come up with that answer. Your conscious mind, once aware of it, probably immediately threw it into the “Silly explanation” column, and I’d hazard a guess that if asked, you’d say you wrote it down as a joke.
Because it clearly isn’t an example of a world-model being allocated evidence. Your explanation is post-hoc—that is, you’re rationalizing. Your description would be an elegant mathematical explanation—I just don’t think it’s correct, as pertains to what your mind is actually doing, and why you find some situations more surprising than others.
Any given random sequence provides evidence of countless extremely low probability world models—we just don’t consider the vast majority of those world-models because they aren’t elevated to our attention.
Like in the sense that there are hypotheses that something omniscient would consider more likely conditional on Alice doing something surprising, that humans just don’t think of because they’re humans? I don’t expect problems coming up with a satisfactory description of ‘the space of all world-models’ to be something we have to fix before we can say anything important about surprise.
Because it clearly isn’t an example of a world-model being allocated evidence. Your explanation is post-hoc—that is, you’re rationalizing. Your description would be an elegant mathematical explanation—I just don’t think it’s correct, as pertains to what your mind is actually doing, and why you find some situations more surprising than others.
Maybe there’s more to be said about the entire class of things that humans have ever labeled as surprising, but this does capture something of what humans mean by surprise, and we can say with particular certainty that it captures what happens in a human mind when a visual stimulus is describable as ‘surprising.’ The framework I described has, to my knowledge, been shown to correspond quite closely to our neuroscientific understanding of visual surprise and has been applied in machine learning algorithms that diagnose patients based on diagnostic images. There are algorithms that register seeing a tumor on a CT scan as ‘surprising’ in a way that is quite likely to be very similar to the way that a human would see that tumor and feel surprised. (I don’t mean that it’s similar in a phenomenological sense. I’m not suggesting that these algorithms have subjective experiences.) I expect this notion of surprise to be generalizable.
I expect this notion of surprise to be generalizable.
Which is what I’m trying to get at. There’s -something- there, more than “amount of updates to world-models”. I’d guess what we call surprise has a complex relationship with the amount of updates applied to world-models, such that a large update to a single world-model is more surprising than an equal “amount” of update applied across one thousand.
I don’t understand what you mean by this.
It is important to note that to capture this problem entirely we must make it explicit that the person observing the coin flips has not only a distribution over sequences of coin flips, but a distribution over world-models that produce the sequences. It is often implicit, and sometimes explicitly assumed, in coin flipping examples, that a normal human flipping a fair coin is something like our null hypothesis about the world. Most coins seem fair in our everyday experience. Alice correctly predicting the sequence that she achieves is evidence that causes a substantial update on our distribution over world-models, even if the two sequences are assigned equal probability in our distribution over sequences given that the null hypothesis is true.
You can also imagine it as the problem of finding an efficient encoding for sequences of coin flips. If you know that certain subsequences are more likely than others, then you should find a way to encode more probable subsequences with less bits. Actually doing this is equivalent to forming beliefs about the world. (Like ‘The coin is biased in this particular way’, or ‘Alice is clairvoyant.’)
Except that we’re not updating all distributions of all possible world-models, or every single sequence would be equally surprising. You’re implicitly looking for evidence that, say, Alice is clairvoyant—you’ve elevated that hypothesis to your awareness before you ever looked at the evidence.
If you don’t even know what you mean by surprise (because that’s what we’re ostensibly trying to figure out, right?), then how can you use the math to deduce that some quantitative measure of surprise is equal in all cases?
I still think this is just a confusion over having a distribution over sequences of coin flips as opposed to a distribution over world-models.
Suppose you have a prior distribution over a space of hypotheses or world-models M, and denote a member of this space as M’. Given data D, you can update using Bayes’ Theorem and obtain a posterior distribution. We can quantify the difference between the prior and posterior using the Kullback-Leibler divergence and use it as a measure of Bayesian surprise. To see how one thing with the same information content as another thing can be more or less surprising, imagine that we have an agent using this framework set in front of a television screen broadcasting white noise. The information content of each frame is very high because there are so many equally likely patterns of noise, but the agent will quickly stop being surprised because it will settle on a world-model that predicts random noise, and the difference between its priors and posteriors over world-models will become very small.
If in the future we want to keep using a coin flip example, I suggest forgetting things that are so mind-like as ‘Alice is clairvoyant’, and maybe just talk about biased and unbiased coins. It seems like an unnecessary complication.
Because the number of bits of information is the same in all cases. Any given random sequence provides evidence of countless extremely low probability world models—we just don’t consider the vast majority of those world-models because they aren’t elevated to our attention.
It’s both necessary and relevant. Indeed, I crafted my example to make your brain come up with that answer. Your conscious mind, once aware of it, probably immediately threw it into the “Silly explanation” column, and I’d hazard a guess that if asked, you’d say you wrote it down as a joke.
Because it clearly isn’t an example of a world-model being allocated evidence. Your explanation is post-hoc—that is, you’re rationalizing. Your description would be an elegant mathematical explanation—I just don’t think it’s correct, as pertains to what your mind is actually doing, and why you find some situations more surprising than others.
I don’t know why you’re using self-information/surprisal interchangeably with surprise. It’s confusing.
Like in the sense that there are hypotheses that something omniscient would consider more likely conditional on Alice doing something surprising, that humans just don’t think of because they’re humans? I don’t expect problems coming up with a satisfactory description of ‘the space of all world-models’ to be something we have to fix before we can say anything important about surprise.
Maybe there’s more to be said about the entire class of things that humans have ever labeled as surprising, but this does capture something of what humans mean by surprise, and we can say with particular certainty that it captures what happens in a human mind when a visual stimulus is describable as ‘surprising.’ The framework I described has, to my knowledge, been shown to correspond quite closely to our neuroscientific understanding of visual surprise and has been applied in machine learning algorithms that diagnose patients based on diagnostic images. There are algorithms that register seeing a tumor on a CT scan as ‘surprising’ in a way that is quite likely to be very similar to the way that a human would see that tumor and feel surprised. (I don’t mean that it’s similar in a phenomenological sense. I’m not suggesting that these algorithms have subjective experiences.) I expect this notion of surprise to be generalizable.
Which is what I’m trying to get at. There’s -something- there, more than “amount of updates to world-models”. I’d guess what we call surprise has a complex relationship with the amount of updates applied to world-models, such that a large update to a single world-model is more surprising than an equal “amount” of update applied across one thousand.