Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to “plain” IBRL regret bounds when we consider the core and the envelope as the “inside” of the agent.
Assume that the action and observation sets factor as A=A0×A1 and O=O0×O1, where (A0,O0) is the interface with the external environment and (A1,O1) is the interface with the envelope.
Let Λ:Π→□(Γ×(A×O)ω) be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:
Marginalizing over Γ. That is, let pr−Γ:Γ×(A×O)ω→(A×O)ω and pr0:(A×O)ω→(A0×O0)ω be the projections. Then, we have the law Λ?:=(pr0pr−Γ)∗∘Λ.
Assuming “logical omniscience”. That is, let τ∗∈Γ be the ground truth. Then, we have the law Λ!:=pr0∗(Λ∣τ∗). Here, we use the conditional defined by Θ∣A:={θ∣A∣θ∈argmaxΘPr[A]}. It’s easy to see this indeed defines a law.
However, requiring low regret w.r.t. neither of these is equivalent to low regret w.r.t Λ:
Learning Λ? is typically no less feasible than learning Λ, however it is a much weaker condition. This is because the metacognitive agents can use policies that query the envelope to get higher guaranteed expected utility.
Learning Λ! is a much stronger condition than learning Λ, however it is typically infeasible. Requiring it leads to AIXI-like agents.
Therefore, metacognitive regret bounds hit a “sweep spot” of stength vs. feasibility which produces a genuinely more powerful agents than IBRL[1].
More precisely, more powerful than IBRL with the usual sort of hypothesis classes (e.g. nicely structured crisp infra-RDP). In principle, we can reduce metacognitive regret bounds to IBRL regret bounds using non-crsip laws, since there’s a very general theorem for representing desiderata as laws. But, these laws would have a very peculiar form that seems impossible to guess without starting with metacognitive agents.
Sort of obvious but good to keep in mind: Metacognitive regret bounds are not easily reducible to “plain” IBRL regret bounds when we consider the core and the envelope as the “inside” of the agent.
Assume that the action and observation sets factor as A=A0×A1 and O=O0×O1, where (A0,O0) is the interface with the external environment and (A1,O1) is the interface with the envelope.
Let Λ:Π→□(Γ×(A×O)ω) be a metalaw. Then, there are two natural ways to reduce it to an ordinary law:
Marginalizing over Γ. That is, let pr−Γ:Γ×(A×O)ω→(A×O)ω and pr0:(A×O)ω→(A0×O0)ω be the projections. Then, we have the law Λ?:=(pr0pr−Γ)∗∘Λ.
Assuming “logical omniscience”. That is, let τ∗∈Γ be the ground truth. Then, we have the law Λ!:=pr0∗(Λ∣τ∗). Here, we use the conditional defined by Θ∣A:={θ∣A∣θ∈argmaxΘPr[A]}. It’s easy to see this indeed defines a law.
However, requiring low regret w.r.t. neither of these is equivalent to low regret w.r.t Λ:
Learning Λ? is typically no less feasible than learning Λ, however it is a much weaker condition. This is because the metacognitive agents can use policies that query the envelope to get higher guaranteed expected utility.
Learning Λ! is a much stronger condition than learning Λ, however it is typically infeasible. Requiring it leads to AIXI-like agents.
Therefore, metacognitive regret bounds hit a “sweep spot” of stength vs. feasibility which produces a genuinely more powerful agents than IBRL[1].
More precisely, more powerful than IBRL with the usual sort of hypothesis classes (e.g. nicely structured crisp infra-RDP). In principle, we can reduce metacognitive regret bounds to IBRL regret bounds using non-crsip laws, since there’s a very general theorem for representing desiderata as laws. But, these laws would have a very peculiar form that seems impossible to guess without starting with metacognitive agents.