Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form
limγ→1Eμπγ[U(γ)]≥f(μ)
Here, f doesn’t have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in μ so any π that satisfies this will also satisfy it for the concave hull of f.
What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)
But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on γ, but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy π′0? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:
limγ→1Eμπγ[U(γ)]≥f(π,μ)
To achieve this, we postulate a predictor that guesses the policy, producing the guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we assume the guess is not influenced by the agent’s actions so we don’t need π in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.
In particular, this captures self-referential desiderata of the type “the policy cannot be improved by changing it in this particular way”. These are of the form:
limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]
It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting f(π,μ) to 1 for policies outside the space.
The fact that quasi-Bayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct starting point, after which we need to start proving results about which fuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn’t impose much restrictions on what the agent should be, it provides the natural language in which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior is universal in some appropriate sense.
Another explanation why maximin is a natural decision rule: when we apply maximin to fuzzy beliefs, the requirement to learn a particular class of fuzzy hypotheses is a very general way to formulate asymptotic performance desiderata for RL agents. So general that it seems to cover more or less anything you might want. Indeed, the definition directly leads to capturing any desideratum of the form
limγ→1Eμπγ[U(γ)]≥f(μ)
Here, f doesn’t have to be concave: the concavity condition in the definition of fuzzy beliefs is there because we can always assume it without loss of generality. This is because the left hand side in linear in μ so any π that satisfies this will also satisfy it for the concave hull of f.
What if instead of maximin we want to apply the minimax-regret decision rule? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥V(μ,γ)−f(μ)
But, it has the same form! Therefore we can consider it as a special case of the applying maximin (more precisely, it requires allowing the fuzzy belief to depend on γ, but this is not a problem for the basics of the formalism).
What if we want our policy to be at least as good as some fixed policy π′0? Then the desideratum is
limγ→1Eμπγ[U(γ)]≥Eμπ′0[U(γ)]
It still has the same form!
Moreover, the predictor/Nirvana trick allows us to generalize this to desiderata of the form:
limγ→1Eμπγ[U(γ)]≥f(π,μ)
To achieve this, we postulate a predictor that guesses the policy, producing the guess ^π, and define the fuzzy belief using the function Eh∼μ[f(^π(h),μ)] (we assume the guess is not influenced by the agent’s actions so we don’t need π in the expected value). Using Nirvana trick, we effectively force the guess to be accurate.
In particular, this captures self-referential desiderata of the type “the policy cannot be improved by changing it in this particular way”. These are of the form:
limγ→1Eμπγ[U(γ)]≥EμF(π)[U(γ)]
It also allows us to effectively restrict the policy space (e.g. impose computational resource constraints) by setting f(π,μ) to 1 for policies outside the space.
The fact that quasi-Bayesian RL is so general can also be regarded as a drawback: the more general a framework the less information it contains, the less useful constraints it imposes. But, my perspective is that QBRL is the correct starting point, after which we need to start proving results about which fuzzy hypotheses classes are learnable, and within what sample/computational complexity. So, although QBRL in itself doesn’t impose much restrictions on what the agent should be, it provides the natural language in which desiderata should be formulated. In addition, we can already guess/postulate that an ideal rational agent should be a QBRL agent whose fuzzy prior is universal in some appropriate sense.