But in Newcomb’s problem, the agent’s reward in case of wrong prediction is already defined. For example, if the agent one-boxes but the predictor predicted two-boxing, the reward should be zero. If you change that to +infinity, aren’t you open to the charge of formalizing the wrong problem?
The point is, if you put this “quasi-Bayesian” agent into an iterated Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you’re judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.
Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent’s point of view, it would reach Nirvana if it dodged the predictor. From Omega’s point of view, if Omega two-boxed and the agent one-boxed, the agent’s reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual “Omega makes an error of prediction” is ill-defined, it’s conditioning on an event of probability 0.
Yeah, I think I can make peace with that. Another way to think of it is that we can keep the reward structure of the original Newcomb’s problem, but instead of saying “Omega is almost always right” we add another person Bob (maybe the mad scientist who built Omega) who’s willing to pay you a billion dollars if you prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess the remaining question is why minimaxing is the right thing to do. And if randomizing is allowed, the idea of Omega predicting how you’ll randomize seems a bit dodgy as well.
Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form
limγ→1Eμπγ[U(γ)]≥f(μ)
Here, f doesn’t have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in μ so any π that satisfies this will also satisfy it for the concave hull of f.
What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)
But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on γ, but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy π′0? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:
limγ→1Eμπγ[U(γ)]≥f(π,μ)
To achieve this, we postulate a predictor that guesses the policy, producing the guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we assume the guess is not influenced by the agent’s actions so we don’t need π in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.
In particular, this captures self-referential desiderata of the type “the policy cannot be improved by changing it in this particular way”. These are of the form:
limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]
It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting f(π,μ) to 1 for policies outside the space.
The fact that quasi-Bayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct starting point, after which we need to start proving results about which fuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn’t impose much restrictions on what the agent should be, it provides the natural language in which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior is universal in some appropriate sense.
Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more “philosophical” defense of maximin is possible, analogous to VNM / complete class theorems, but I don’t know (I actually saw some papers in that vein but haven’t read them in detail.)
If the agent has random bits that Omega doesn’t see, and Omega is predicting the probabilities of the agent’s actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven’t worked out the details. Specifically, I think that we can define some function X that depends on the agent’s actions and Omega’s predictions so far (a measure of Omega’s apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of X over time is finite with probability 1. Then, we consider consider a family of models, where model number n says that X<n for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.
EDIT 1: I think X should be something like, how much money would a gambler following a particular strategy win, betting against Omega.
EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent one-boxing. Every time the agent two-boxes, the gambler loses 1 dollar. Every time the agent one-boxes, the gambler wins 1p−1 dollars, where p is the probability Omega assigned to one-boxing. Now it’s possible to see that one-boxing guarantees the “CC” payoff under the corresponding model (in the γ→1 limit): If the agent one-boxes, the gambler keeps winning unless Omega converges to one-boxing rapidly enough. In the case of a general Newcomb-like problem, just replace “one-boxes” by “follows the FDT strategy”.
I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.
The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if “idealized Omega” is wrong.
But in Newcomb’s problem, the agent’s reward in case of wrong prediction is already defined. For example, if the agent one-boxes but the predictor predicted two-boxing, the reward should be zero. If you change that to +infinity, aren’t you open to the charge of formalizing the wrong problem?
The point is, if you put this “quasi-Bayesian” agent into an iterated Newcomb-like problem, it will learn to get the maximal reward (i.e. the reward associated with FDT). So, if you’re judging it from the side, you will have to concede it behaves rationally, regardless of its internal representation of reality.
Philosophically, my point of view is, it is an error to think that counterfactuals have objective, observer-independent, meaning. Instead, we can talk about some sort of consistency conditions between the different points of view. From the agent’s point of view, it would reach Nirvana if it dodged the predictor. From Omega’s point of view, if Omega two-boxed and the agent one-boxed, the agent’s reward would be zero (and the agent would learn its beliefs were wrong). From a third-person point of view, the counterfactual “Omega makes an error of prediction” is ill-defined, it’s conditioning on an event of probability 0.
Yeah, I think I can make peace with that. Another way to think of it is that we can keep the reward structure of the original Newcomb’s problem, but instead of saying “Omega is almost always right” we add another person Bob (maybe the mad scientist who built Omega) who’s willing to pay you a billion dollars if you prove Omega wrong. Then minimaxing indeed leads to one-boxing. Though I guess the remaining question is why minimaxing is the right thing to do. And if randomizing is allowed, the idea of Omega predicting how you’ll randomize seems a bit dodgy as well.
Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form
limγ→1Eμπγ[U(γ)]≥f(μ)
Here, f doesn’t have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in μ so any π that satisfies this will also satisfy it for the concave hull of f.
What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)
But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on γ, but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy π′0? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:
limγ→1Eμπγ[U(γ)]≥f(π,μ)
To achieve this, we postulate a predictor that guesses the policy, producing the guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we assume the guess is not influenced by the agent’s actions so we don’t need π in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.
In particular, this captures self-referential desiderata of the type “the policy cannot be improved by changing it in this particular way”. These are of the form:
limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]
It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting f(π,μ) to 1 for policies outside the space.
The fact that quasi-Bayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct starting point, after which we need to start proving results about which fuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn’t impose much restrictions on what the agent should be, it provides the natural language in which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior is universal in some appropriate sense.
Well, I think that maximin is the right thing to do because it leads to reasonable guarantees for quasi-Bayesian reinforcement learning agents. I think of incomplete models as properties that the environment might satisfy. It is necessary to speak of properties instead of complete models since the environment might be too complex to understand in full (for example because it contains Omega, but also for more prosaic reasons), but we can hope it at least has properties/patterns the agent can understand. A quasi-Bayesian agent has the guarantee that, whenever the environment satisfies one of the properties in its prior, the expected utility will converge at least to the maximin for this property. In other words, such an agent is able to exploit any true property of the environment it can understand. Maybe a more “philosophical” defense of maximin is possible, analogous to VNM / complete class theorems, but I don’t know (I actually saw some papers in that vein but haven’t read them in detail.)
If the agent has random bits that Omega doesn’t see, and Omega is predicting the probabilities of the agent’s actions, then I think we can still solve it with quasi-Bayesian agents but it requires considering more complicated models and I haven’t worked out the details. Specifically, I think that we can define some function X that depends on the agent’s actions and Omega’s predictions so far (a measure of Omega’s apparent inaccuracy), s.t. if Omega is an accurate predictor, then, the supremum of X over time is finite with probability 1. Then, we consider consider a family of models, where model number n says that X<n for all times. Since at least one of these models is true, the agent will learn it, and will converge to behaving appropriately.
EDIT 1: I think X should be something like, how much money would a gambler following a particular strategy win, betting against Omega.
EDIT 2: Here is the solution. In the case of original Newcomb, consider a gambler that bets against Omega on the agent one-boxing. Every time the agent two-boxes, the gambler loses 1 dollar. Every time the agent one-boxes, the gambler wins 1p−1 dollars, where p is the probability Omega assigned to one-boxing. Now it’s possible to see that one-boxing guarantees the “CC” payoff under the corresponding model (in the γ→1 limit): If the agent one-boxes, the gambler keeps winning unless Omega converges to one-boxing rapidly enough. In the case of a general Newcomb-like problem, just replace “one-boxes” by “follows the FDT strategy”.
I agree that you can assign what ever belief you want (e.g. what ever is useful for the agents decision making proses) for for what happens in the counterfactual when omega is wrong, in decision problems where Omega is assumed to be a perfect predictor. However if you want to generalise to cases where Omega is an imperfect predictor (as you do mention), then I think you will (in general) have to put in the correct reward for Omega being wrong, becasue this is something that might actually be observed.
The method should work for imperfect predictors as well. In the simplest case, the agent can model the imperfect predictor as perfect predictor + random noise. So, it definitely knows the correct reward for Omega being wrong. It still believes in Nirvana if “idealized Omega” is wrong.