Fixed. Thanks!
royf
How to Disentangle the Past and the Future
How does deciding one model is true give you more information?
Let’s assume a strong version of Bayesianism, which entails the maximum entropy principle. So our belief is the one that has the maximum entropy, among those consistent with our prior information. If we now add the information that some model is true, this generally invalidate our previous belief, making the new maximum-entropy belief one of lower entropy. The reduction in entropy is the amount of information you gain by learning the model. In a way, this is a cost we pay for “narrowing” our belief.
The upside of it is that it tells us something useful about the future. Of course, not all information regarding the world is relevant for future observations. The part that doesn’t help control our anticipation is failing to pay rent, and should be evacuated. The part that does inform us about the future may be useful enough to be worth the cost we pay in taking in new information.
I’ll expand on all of this in my sequence on reinforcement learning.
You’re not really wrong. The thing is that “Occam’s razor” is a conceptual principle, not one mathematically defined law. A certain (subjectively very appealing) formulation of it does follow from Bayesianism.
P(AB model) \propto P(AB are correct) and P(A model) \propto P(A is correct). Then P(AB model) ⇐ P(A model).
Your math is a bit off, but I understand what you mean. If we have two sets of models, with no prior information to discriminate between their members, then the prior gives less probability to each model in the larger set than in the smaller one.
More generally, if deciding that model 1 is true gives you more information than deciding that model 2 is true, that means that the maximum entropy given model 1 is lower than that given model 2, which in turn means (under the maximum entropy principle) that model 1 was a-priori less likely.
Anyway, this is all besides the discussion that inspired my previous comment. My point was that even without Popper and Jaynes to enlighten us, science was making progress using other methods of rationality, among which is a myriad of non-Bayesian interpretations of Occam’s razor.
The ease with which images, events and concepts come to mind is correlated with how frequently they have been observed, which in turn is correlated with how likely they are to happen again.
Yes, and I was trying to make this description one level more concrete.
Things never happen the exact same way twice. The way that past observations are correlated with what may happen again is complicated—in a way, that’s exactly what “concepts” capture.
So we don’t just recall something that happened and predict that it will happen again. Rather, we compose a prediction based on an integration of bits and patches from past experiences. Recalling these bits and patches as relevant for the context of the prediction—and of each other—is a complicated task, and I propose that an “internal availability” mechanism is needed to perform it.
Take for example your analysis of the poker hand I partially described. You give 3 possibilities for what the truth of it may be. Are there any other possibilities? Maybe the player is bluffing to gain the reputation of a bluffer? Maybe she mistook a 4 for an ace (it happened to me once...)? Maybe aliens hijacked her brain?
It would be impossible to enumerate or notice all the possibilities, but fortunately we don’t have to. We make only the most likely and important ones available.
I was trying to give a specific reason that the availability heuristic is there: it’s coupled with another mechanism that actually generates the availability; and then to say a few things about this other mechanism.
Does anyone have specific advice on how I could convey this better?
Point-Based Value Iteration
Internal Availability
Imagine a bowl of jellybeans. [...]
Allow me to suggest a simpler thought experiment, that hopefully captures the essence of yours, and shows why your interpretation (of the correct math) is incorrect.
There are 100 recording studios, each recording each day with probability 0.5. Everybody knows that.
There’s a red light outside each studio to signal that a session is taking place that day, except for one rogue studio, where the signal is reversed, being off when there’s a session and on when there isn’t. Only persons B and C know that.
A, B and C are standing at the door of a studio, but only C knows that it’s the rogue one. How do their beliefs that there’s a session inside change by observing that the red light is on? A keeps the 50-50. B now thinks it’s 99-1. Only C knows that there’s no session.
So your interpretation, as I understand it, would be to say that A and B updated in the “wrong direction”. But wait! I practically gave you the same prior information that C has—of course you agree with her! Let’s rewrite the last paragraph:
A, B and C are standing at the door of a studio. For some obscure reason, C secretly believes that it’s the rogue one. Wouldn’t you now agree with B?
And now I can do the same for A, by not revealing to you, the reader, the significance of the red lights. My point is that as long as someone runs a Bayesian update, you can’t call that the “wrong direction”. Maybe they now believe in things that you judge less likely, based on the information that you have, but that doesn’t make you right and them wrong. Reality makes them right or wrong, unfortunately there’s no one around who knows reality in any other way than through their subjective information-revealing observations.
To anyone thinking this is not random, with 42 votes in:
The p-value is 0.895 (this is the probability of seeing at least this much non-randomness, assuming a uniform distribution)
The entropy is 2.302bits instead of log(5) = 2.322bits, for 0.02bits KL-distance (this is the number of bits you lose for encoding one of these votes as if it was random)
If you think you see a pattern here, you should either see a doctor or a statistician.
It is perfectly legal under the bayes to learn nothing from your observations.
Right, in degenerate cases, when there’s nothing to be learned, the two extremes of learning nothing and everything coincide.
Or learn in the wrong direction, or sideways, or whatever.
To the extent that I understand your navigational metaphor, I disagree with this statement. Would you kindly explain?
There is no unique “Bayesian belief”.
If you mean to say that there’s no unique justifiable prior, I agree. The prior in our setting is basically what you assume you know about the dynamics of the system—see my reply to RichardKennaway.
However, given that prior and the agent’s observations, there is a unique Bayesian belief, the one I defined above. That’s pretty much the whole point of Bayesianism, the existence of a subjectively objective probability.
If you had the “right” prior, you would find that would have to do very little updating, because the right prior is already right.
This is true in a constant world, or with regard to parts of the world which are constant. And mind you, it’s true only with high probability: there’s always the slight chance that the sky is not, after all, blue.
But in a changing world, where part of the change is revealed to you through new observations, you have to keep pace. The right prior was right yesterday, today there’s new stuff to know.
Everything you say is essentially true.
As the designer of the agent, will you be explicitly providing it with that information in some future instalment?
Technically, we don’t need to provide the agent with p and sigma explicitly. We use these parameters when we build the agent’s memory update scheme, but the agent is not necessarily “aware” of the values of the parameters from inside the algorithm.
Let’s take for example an autonomous rover on Mars. The gravity on Mars is known at the time of design, so the rover’s software, and even hardware, is built to operate under these dynamics. The wind velocity at the time and place of landing, on the other hand, is unknown. The rover may need to take measurements to determine this parameter, and encode it in its memory, before it can take it into account in choosing further actions.
But if we are thoroughly Bayesian, then something is known about the wind prior to experience. Is it likely to change every 5 minutes or can the rover wait longer before measuring again? What should be the operational range of the instruments? And so on. In this case we would include this prior in p, while the actual wind velocity is instead hidden in the world state (only to be observed occasionally and partially).
Ultimately, we could include all of physics in our belief—there’s always some Einstein to tell us that Newtonian physics is wrong. The problem is that a large belief space makes learning harder. This is why most humans struggle with intuitive understanding of relativity or quantum mechanics—our brains are not made to represent this part of the belief space.
This is also why reinforcement learning gives special treatment to the case where there are unknown but unchanging parameters of the world dynamics: the “unknown” part makes the belief space large enough to make special algorithms necessary, while the “unchanging” part makes these algorithms possible.
For LaTeX instructions, click “Show help” and then “More Help” (or go here).
The Bayesian Agent
If you’re a devoted Bayesian, you probably know how to update on evidence, and even how to do so repeatedly on a sequence of observations. What you may not know is how to update in a changing world. Here’s how:
%3d\Pr(W_{t+1}|O1,\ldots,O{t+1})%3d\frac{\sigma(O{t+1}|W{t+1})\cdot\Pr(W_{t+1}|O_1,\ldots,O_t)}{\sumw\sigma(O{t+1}|w)\cdot\Pr(w|O_1,\ldots,O_t)})
As usual with Bayes’ theorem, we only need to calculate the numerator for different values of , and the denominator will normalize them to sum to 1, as probabilities do. We know as part of the dynamics of the system, so we only need ). This can be calculated by introducing the other variables in the process:
%3d\sum_{W_t,A_t}\Pr(W_t,At,W{t+1}|O_1,\ldots,O_t))
An important thing to notice is that, given the observable history, the world state and the action are independent—the agent can’t act on unseen information. We continue:
\cdot\Pr(A_t|O_1,\ldots,Ot)\cdot%20p(W{t+1}|W_t,A_t))
Recall that the agent’s belief is a function of the observable history, and that the action only depends on the observable history through its memory . We conclude:
\cdot\pi(A_t|Bt)\cdot%20p(W{t+1}|W_t,A_t))
p(H|E1,E2) [...] is simply not something you can calculate in probability theory from the information given [i.e. p(H|E1) and p(H|E2)].
Jaynes would disapprove.
You continue to give more information, namely that p(H|E1,E2) = p(H|E1). Thanks, that reduces our uncertainty about p(H|E1,E2).
But we are hardly helpless without it. Whatever happened to the Maximum Entropy Principle? Incidentally, the maximum entropy distribution (given the initial information) does have E1 and E2 independent. If your intuition says this before having more information, it is good.
Don’t say that an answer can’t be reached without further information. Say: here’s more information to make your answer better.
Clearly you have some password I’m supposed to guess.
This post is not preliminary. It’s supposed to be interesting in itself. If it’s not, then I’m doing something wrong, and would appreciate constructive criticism.
Reinforcement, Preference and Utility
That’s an excellent point. Of course one cannot introduce RL without talking about the reward signal, and I’ve never intended to.
To me, however, the defining feature of RL is the structure of the solution space, described in this post. To you, it’s the existence of a reward signal. I’m not sure that debating this difference of opinion is the best use of our time at this point. I do hope to share my reasons in future posts, if only because they should be interesting in themselves.
As for your last point: RL is indeed a very general setting, and classical planning can easily be formulated in RL terms.
I explained this in my non-standard introduction to reinforcement learning.
We can define the world as having the Markov property, i.e. as a Markov process. But when we split the world into an agent and its environment, we lose the Markov property for each of them separately.
I’m using non-standard notation and terminology because they are needed for the theory I’m developing in these posts. In future posts I’ll try to link more to the handful of researchers who do publish on this theory. I did publish one post relating the terminology I’m using to more standard research.