It’s obvious that you intend this as requiring research, including making good conceptual choices, rather than having a fixed answer. However, I’m going to speak from my current understanding of predictive processing.
I’m quite interested in your (John’s) take on how the following differs from what you had in mind.
I believe there are several possible answers based on different ways of using predictive-processing-associated ideas.
A. Soft-max decision-making.
One thing I’ve seen in a presentation on this stuff is the claim of a close connection between probability and utility, namely u=log(p).
This relates to a very common approximate model of bounded rationality: you introduce some randomness, but make worse mistakes less probable, by making actions exponentially more probable as their utility goes up. The level of rationality can be controlled by a “temperature” parameter—higher temperature means more randomness, lower temperature means closer to just always taking the max.
The u=log(p) idea takes that “approximation” as definitional; action probabilities are revealed preferences, from which we can find utilities by taking logarithms.
The randomness can be interpreted as exploration. I don’t personally see that interpretation as very good, since this form of randomness does not vary based on model uncertainty, but there may be justifications I’m not aware of.
The stronger attempt to justify the randomness, in my book, is based on monte carlo inference. However, that’s better discussed under the next heading.
B. Sampling from wishful thinking.
If you were to construct an agent by the formula from option (A), you would first define the agent’s beliefs and desires in the usual Bayesian way. You’d then calculate expected utilities for events in the normal way. You only depart from standard Bayesian decision-making at the last step, where you randomize rather than just taking the best action.
The implicit promise of the u=log(p) formula is to provide a deeper unification of belief and value than that, and correspondingly, a deeper restructuring of decision theory.
One commonly discussed proposal is as follows: condition on success, then sample from the resulting distribution on actions. (You don’t necessarily have a binary notion of “success” if you attach real-valued utilities to the various outcomes, but, there is a generalization where we condition on “utility being high” without exactly specifying how high it is. This will involve the same “temperature” parameter mentioned earlier.)
The technical name for this idea is “planning by inference”, because we can use algorithms for Monte Carlo inference to sample actions. We’re using inference algorithms to plan! That’s a useful unification of utility and probability: machinery previously used for one purpose, is now used for both.
It also kinda captures the intuition you mentioned, about restricting our world-model to assume some stuff we want to be true:
Abstracting out the key idea: we pack all of the complicated stuff into our world-model, hardcode some things into our world-model which we want to be true, then generally try to make the model match reality.
However, planning-by-inference can cause us to take some pretty dumb-looking actions.
For example, let’s say that we need $200 for rent money. For simplicity, we have binary success/failure: either we get the money we need, or not. We have $25 which we can use to gamble, for a 1/16th chance of making the $200 we need. Alternately, we happen to know tomorrow’s winning lotto numbers, which we can enter in for a 100% chance of getting the money we need.
However, taking random actions, let’s say there is only a 1/million chance of entering the winning lotto numbers.
Conditioning on our success, it’s much more probable that we gamble with our $25 and get the money we need that way.
So planning-by-inference is heavily biased toward plans of action which are not too improbable in the prior before conditioning on success.
On the other hand, the temperature parameter can help us out here. Adjusting the temperature looks kind of like “conditioning on success multiple times”—IE, it’s as if you took the new distribution on actions as the prior, and then conditioned again to further bias things in the direction of success.
This has a somewhat nice justification in terms of monte-carlo algorithms. For some algorithms, this “temperature” ends up being an indication of how long you took to think. There’s a bias toward actions with high prior probabilities because that’s where you look first when planning, effectively (due to the randomness of the search).
This sounds like a nice account of bounded rationality: the randomness in the p=log(u) model is due to the boundedness of our search, and the fact that we may or may not find the good solutions in that time.
Except for one major problem: this kind of random search isn’t what humans, or AIs, do in general. Even within the realm of Monte Carlo algorithms, there are a lot of optimizations one can add which would destroy the p=log(u) relationship. I don’t currently know of any reason to suppose that there’s some nice generalization which holds for computationally efficient minds.
So ultimately, I would say that there is a sorta nice theory of bounded rationality here, but not a very nice one.
Except… I actually know a way to address the concern about bias toward a priori actions, while sticking to the planning-by-inference picture, and also using an arguably much better theory of bounded rationality.
C. Logical Induction Decision Theory
As Scott discussed in a recent talk, if you try the planning-by-inference trick with a logical inductor as your inferencer, you maximize expected utility anyway:
This algorithm predicts what it did conditional on having won, and then copies that distribution. It just says, “output whatever I predict that I output conditioned on my having won”.
[...]
But it turns out that you do reach the same endpoint, because the only fixed point of this process is going to do the same as the last algorithm’s. So this algorithm turns out to be functionally the same as the previous one.
One way of understanding what’s happening is this: in the planning-by-inference picture, we start with a prior, and condition on success, then sample actions. This creates a bias toward a priori probable actions, which can result in the irrational behavior I mentioned earlier.
In the context of logical induction, however, we additionally stipulate that the a priori distribution on actions and the updated distribution must match. This has the effect of “updating on success an infinite number of times” (in the sense that I mentioned earlier, where lowering the temperature is kind of like “updating on success again”).
Furthermore, unlike the monte-carlo algorithms mentioned earlier, logical induction is a theoretically very well-founded theory of bounded rationality. Not so bounded you’d want to run it on an actual computer, granted. But at least it addresses the question of what kind of optimality we can enforce on bounded reasoning, rather than just positing a particular kind of computation as the answer.
Since this is equivalent to regular expected utility maximization with logical inductors, there’s no reason to use planning-by-inference, but there’s also no reason not to.
So, what kind of decision theory does this get us?
Cooperate in Prisoner’s Dilemma with agents whose pseudorandom moves exactly match, or sufficiently correlate with, our own. Defect against agents with uncorrelated pseudorandom exploration sequences (even if they otherwise have “the same mental architecture”). So cooperation is pretty difficult.
One-box in Newcomb with a perfect predictor. Two-box if the predictor is imperfect. This holds even if the predictor is extremely accurate (say 99.9% accurate), so long as the agent knows more about its own move than the predictor—the only way the agent will one-box is if the predictor’s prediction contains information about the agent’s own action which the agent does not possess at the time of choosing.
First and foremost, I imagine that the notion of “success” on which the agent conditions is not just a direct translation of “winning” in the decision problem. After all, a lot of the substance of tricky decision theory problems is exactly in that “direct” translation of what-it-means-to-win! Instead, I imagine that the notion of “success” has a lot more supporting infrastructure built into it, and the agent’s actions can directly interact with the supporting infrastructure as well as the nominal goal itself.
A prototypical example here would be an abstraction-based decision theory. There, the notion of “success” would not be “system achieves the maximum amount of utility”, but rather “system abstracts into a utility-maximizing agent”. The system’s “choices” will be used both to maximize utility and to make sure the abstraction holds. The “supporting infrastructure” part—i.e. making sure the abstraction holds—is what would handle things like e.g. acting as though the agent is deciding for simulations of itself (see the link for more explanation of that).
More generally, two other notions of “success” which we could imagine:
“success” means “our model of the territory is accurate, and our modelled-choices maximize our modelled-utility” (though this allows some degrees of freedom in how the model handles counterfactuals)
“success” means “the physical process which output our choices is equivalent to program X” (where X itself would optimize for this notion of success, and probably some other conditions as well; the point here is to check that the computation is not corrupted)
(These are not mutually exclusive.) In both cases, the agent’s decisions would be used to support its internal infrastructure (accurate models, uncorrupted computation) as well as the actual utility-maximization.
Having written that all out, it seems like it might be orthogonal to predictive processing. I had been thinking of these “success” notions more as part-of-the-world-model, mainly because the “success” notions are largely about parts of the world abstracting into specific things (models, program execution, agents). In that context, it made sense to view “enforcing the infrastructure” as part of “making the model and the territory match”. But if abstraction-enforcement is built into the utility function, rather than the model, then it looks less predictive-processing-specific.
“A prototypical example here would be an abstraction-based decision theory. There, the notion of “success” would not be “system achieves the maximum amount of utility”, but rather “system abstracts into a utility-maximizing agent”. The system’s “choices” will be used both to maximize utility and to make sure the abstraction holds. The “supporting infrastructure” part—i.e. making sure the abstraction holds—is what would handle things like e.g. acting as though the agent is deciding for simulations of itself (see the link for more explanation of that).”
isn’t this kind kind of like virtue ethics as opposed to utilitarianism?
I don’t buy the lottery example. You never encoded the fact that you know tomorrow’s numbers. Shouldn’t the prior be that you win a million guranteed if you buy the ticket?
What I’m doing is modeling “gamble with the money” as a simple action—you can imaging there’s a big red button that gives you $200 1/16th of the time and takes all your money otherwise.
And then I’m modeling “but a lotto ticket” as a compound action consisting of entering each number individually.
“Knowing the numbers” means your world model understands that if you’ve entered the right numbers, you get the money. But it doesn’t make “enter the right numbers” probable in the prior.
Of course the conclusion is reverse if we make “enter the right numbers” into a primitive action.
I also didn’t understand that. I was thinking of it more like AlphaStar in the sense that your prior is that you’re going to continue using your current (probabilistic) policy for all the steps involved in what you’re thinking about.
(But not like AlphaStar in that the brain is more likely to do one-or-a-few-steps of rollout with clever hierarchical abstract representations of plans, rather than dozens-of-steps rollouts in a simple one-step-at-a-time way.)
It’s obvious that you intend this as requiring research, including making good conceptual choices, rather than having a fixed answer. However, I’m going to speak from my current understanding of predictive processing.
I’m quite interested in your (John’s) take on how the following differs from what you had in mind.
I believe there are several possible answers based on different ways of using predictive-processing-associated ideas.
A. Soft-max decision-making.
One thing I’ve seen in a presentation on this stuff is the claim of a close connection between probability and utility, namely u=log(p).
This relates to a very common approximate model of bounded rationality: you introduce some randomness, but make worse mistakes less probable, by making actions exponentially more probable as their utility goes up. The level of rationality can be controlled by a “temperature” parameter—higher temperature means more randomness, lower temperature means closer to just always taking the max.
The u=log(p) idea takes that “approximation” as definitional; action probabilities are revealed preferences, from which we can find utilities by taking logarithms.
The randomness can be interpreted as exploration. I don’t personally see that interpretation as very good, since this form of randomness does not vary based on model uncertainty, but there may be justifications I’m not aware of.
The stronger attempt to justify the randomness, in my book, is based on monte carlo inference. However, that’s better discussed under the next heading.
B. Sampling from wishful thinking.
If you were to construct an agent by the formula from option (A), you would first define the agent’s beliefs and desires in the usual Bayesian way. You’d then calculate expected utilities for events in the normal way. You only depart from standard Bayesian decision-making at the last step, where you randomize rather than just taking the best action.
The implicit promise of the u=log(p) formula is to provide a deeper unification of belief and value than that, and correspondingly, a deeper restructuring of decision theory.
One commonly discussed proposal is as follows: condition on success, then sample from the resulting distribution on actions. (You don’t necessarily have a binary notion of “success” if you attach real-valued utilities to the various outcomes, but, there is a generalization where we condition on “utility being high” without exactly specifying how high it is. This will involve the same “temperature” parameter mentioned earlier.)
The technical name for this idea is “planning by inference”, because we can use algorithms for Monte Carlo inference to sample actions. We’re using inference algorithms to plan! That’s a useful unification of utility and probability: machinery previously used for one purpose, is now used for both.
It also kinda captures the intuition you mentioned, about restricting our world-model to assume some stuff we want to be true:
However, planning-by-inference can cause us to take some pretty dumb-looking actions.
For example, let’s say that we need $200 for rent money. For simplicity, we have binary success/failure: either we get the money we need, or not. We have $25 which we can use to gamble, for a 1/16th chance of making the $200 we need. Alternately, we happen to know tomorrow’s winning lotto numbers, which we can enter in for a 100% chance of getting the money we need.
However, taking random actions, let’s say there is only a 1/million chance of entering the winning lotto numbers.
Conditioning on our success, it’s much more probable that we gamble with our $25 and get the money we need that way.
So planning-by-inference is heavily biased toward plans of action which are not too improbable in the prior before conditioning on success.
On the other hand, the temperature parameter can help us out here. Adjusting the temperature looks kind of like “conditioning on success multiple times”—IE, it’s as if you took the new distribution on actions as the prior, and then conditioned again to further bias things in the direction of success.
This has a somewhat nice justification in terms of monte-carlo algorithms. For some algorithms, this “temperature” ends up being an indication of how long you took to think. There’s a bias toward actions with high prior probabilities because that’s where you look first when planning, effectively (due to the randomness of the search).
This sounds like a nice account of bounded rationality: the randomness in the p=log(u) model is due to the boundedness of our search, and the fact that we may or may not find the good solutions in that time.
Except for one major problem: this kind of random search isn’t what humans, or AIs, do in general. Even within the realm of Monte Carlo algorithms, there are a lot of optimizations one can add which would destroy the p=log(u) relationship. I don’t currently know of any reason to suppose that there’s some nice generalization which holds for computationally efficient minds.
So ultimately, I would say that there is a sorta nice theory of bounded rationality here, but not a very nice one.
Except… I actually know a way to address the concern about bias toward a priori actions, while sticking to the planning-by-inference picture, and also using an arguably much better theory of bounded rationality.
C. Logical Induction Decision Theory
As Scott discussed in a recent talk, if you try the planning-by-inference trick with a logical inductor as your inferencer, you maximize expected utility anyway:
One way of understanding what’s happening is this: in the planning-by-inference picture, we start with a prior, and condition on success, then sample actions. This creates a bias toward a priori probable actions, which can result in the irrational behavior I mentioned earlier.
In the context of logical induction, however, we additionally stipulate that the a priori distribution on actions and the updated distribution must match. This has the effect of “updating on success an infinite number of times” (in the sense that I mentioned earlier, where lowering the temperature is kind of like “updating on success again”).
Furthermore, unlike the monte-carlo algorithms mentioned earlier, logical induction is a theoretically very well-founded theory of bounded rationality. Not so bounded you’d want to run it on an actual computer, granted. But at least it addresses the question of what kind of optimality we can enforce on bounded reasoning, rather than just positing a particular kind of computation as the answer.
Since this is equivalent to regular expected utility maximization with logical inductors, there’s no reason to use planning-by-inference, but there’s also no reason not to.
So, what kind of decision theory does this get us?
Cooperate in Prisoner’s Dilemma with agents whose pseudorandom moves exactly match, or sufficiently correlate with, our own. Defect against agents with uncorrelated pseudorandom exploration sequences (even if they otherwise have “the same mental architecture”). So cooperation is pretty difficult.
One-box in Newcomb with a perfect predictor. Two-box if the predictor is imperfect. This holds even if the predictor is extremely accurate (say 99.9% accurate), so long as the agent knows more about its own move than the predictor—the only way the agent will one-box is if the predictor’s prediction contains information about the agent’s own action which the agent does not possess at the time of choosing.
Fail transparent Newcomb.
Fail counterfactual mugging.
Fail Parfit’s Hitchhiker.
Fail at agent-simulates-predictor.
This was a solid explanation, thanks.
Some differences from what I imagine...
First and foremost, I imagine that the notion of “success” on which the agent conditions is not just a direct translation of “winning” in the decision problem. After all, a lot of the substance of tricky decision theory problems is exactly in that “direct” translation of what-it-means-to-win! Instead, I imagine that the notion of “success” has a lot more supporting infrastructure built into it, and the agent’s actions can directly interact with the supporting infrastructure as well as the nominal goal itself.
A prototypical example here would be an abstraction-based decision theory. There, the notion of “success” would not be “system achieves the maximum amount of utility”, but rather “system abstracts into a utility-maximizing agent”. The system’s “choices” will be used both to maximize utility and to make sure the abstraction holds. The “supporting infrastructure” part—i.e. making sure the abstraction holds—is what would handle things like e.g. acting as though the agent is deciding for simulations of itself (see the link for more explanation of that).
More generally, two other notions of “success” which we could imagine:
“success” means “our model of the territory is accurate, and our modelled-choices maximize our modelled-utility” (though this allows some degrees of freedom in how the model handles counterfactuals)
“success” means “the physical process which output our choices is equivalent to program X” (where X itself would optimize for this notion of success, and probably some other conditions as well; the point here is to check that the computation is not corrupted)
(These are not mutually exclusive.) In both cases, the agent’s decisions would be used to support its internal infrastructure (accurate models, uncorrupted computation) as well as the actual utility-maximization.
Having written that all out, it seems like it might be orthogonal to predictive processing. I had been thinking of these “success” notions more as part-of-the-world-model, mainly because the “success” notions are largely about parts of the world abstracting into specific things (models, program execution, agents). In that context, it made sense to view “enforcing the infrastructure” as part of “making the model and the territory match”. But if abstraction-enforcement is built into the utility function, rather than the model, then it looks less predictive-processing-specific.
isn’t this kind kind of like virtue ethics as opposed to utilitarianism?
Interesting analogy, I hadn’t thought of that.
I don’t buy the lottery example. You never encoded the fact that you know tomorrow’s numbers. Shouldn’t the prior be that you win a million guranteed if you buy the ticket?
No! You also have to enter the right numbers.
What I’m doing is modeling “gamble with the money” as a simple action—you can imaging there’s a big red button that gives you $200 1/16th of the time and takes all your money otherwise.
And then I’m modeling “but a lotto ticket” as a compound action consisting of entering each number individually.
“Knowing the numbers” means your world model understands that if you’ve entered the right numbers, you get the money. But it doesn’t make “enter the right numbers” probable in the prior.
Of course the conclusion is reverse if we make “enter the right numbers” into a primitive action.
I also didn’t understand that. I was thinking of it more like AlphaStar in the sense that your prior is that you’re going to continue using your current (probabilistic) policy for all the steps involved in what you’re thinking about.
(But not like AlphaStar in that the brain is more likely to do one-or-a-few-steps of rollout with clever hierarchical abstract representations of plans, rather than dozens-of-steps rollouts in a simple one-step-at-a-time way.)
See my answer to Gurkenglas.
My understanding of planning by inference (aka active inference?) is not so much like AlphaStar. More to say here, but I’m out of time atm.