Here’s the sketch of an AIT toy model theorem that in complex environments without traps, applying selection pressure reliably produces learning agents. I view it as an example of Wentworth’s “selection theorem” concept.
Consider any environment μ of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function
r:(A×O)∗→[0,1]
Suppose that there exists a policy π∗ of finite Kolmogorov complexity (i.e. computable) that’s optimal for μ in the slow discount limit. That is,
limγ→1(1−γ)(maxπEμπ[∞∑n=0γnrn]−Eμπ∗[∞∑n=0γnrn])=0
Then, μ cannot be the only environment with this property. Otherwise, this property could be used to define μ using a finite number of bits, which is impossible[1]. Since μ requires infinitely many more bits to specify than π∗ and r, there has to be infinitely many environments with the same property[2]. Therefore, π∗ is a reinforcement learning algorithm for some infinite class of hypothesis.
Moreover, there are natural examples of μ as above. For instance, let’s construct μ as an infinite sequence of finite communicating infra-RDP refinements that converges to an unambiguous (i.e. “not infra”) environment. Since each refinement involves some arbitrary choice, “most” such μ have infinite Kolmogorov complexity. In this case, π∗ exists: it can be any learning algorithm for finite communicating infra-RDP with arbitrary number of states.
Besides making this a rigorous theorem, there are many additional questions for further investigation:
Can we make similar claims that incorporate computational complexity bounds? It seems that it should be possible to at least constraint our algorithms to be PSPACE in some sense, but not obvious how to go beyond that (maybe it would require the frugal universal prior).
Can we argue that π∗ must be an infra-Bayesian learning algorithm? Relatedly, can we make a variant where computable/space-bounded policies can only attain some part of the optimal asymptotic reward of μ?
The setting we described requires that all the traps in μ can be described in a finite number of bits. If this is not the case, can we make a similar sort of an argument that implies π∗ is Bayes-optimal for some prior over a large hypothesis class?
Probably, making this argument rigorous requires replacing the limit with a particular regret bound. I ignore this for the sake of simplifying the core idea.
Here’s the sketch of an AIT toy model theorem that in complex environments without traps, applying selection pressure reliably produces learning agents. I view it as an example of Wentworth’s “selection theorem” concept.
Consider any environment μ of infinite Kolmogorov complexity (i.e. uncomputable). Fix a computable reward function
r:(A×O)∗→[0,1]Suppose that there exists a policy π∗ of finite Kolmogorov complexity (i.e. computable) that’s optimal for μ in the slow discount limit. That is,
limγ→1(1−γ)(maxπEμπ[∞∑n=0γnrn]−Eμπ∗[∞∑n=0γnrn])=0Then, μ cannot be the only environment with this property. Otherwise, this property could be used to define μ using a finite number of bits, which is impossible[1]. Since μ requires infinitely many more bits to specify than π∗ and r, there has to be infinitely many environments with the same property[2]. Therefore, π∗ is a reinforcement learning algorithm for some infinite class of hypothesis.
Moreover, there are natural examples of μ as above. For instance, let’s construct μ as an infinite sequence of finite communicating infra-RDP refinements that converges to an unambiguous (i.e. “not infra”) environment. Since each refinement involves some arbitrary choice, “most” such μ have infinite Kolmogorov complexity. In this case, π∗ exists: it can be any learning algorithm for finite communicating infra-RDP with arbitrary number of states.
Besides making this a rigorous theorem, there are many additional questions for further investigation:
Can we make similar claims that incorporate computational complexity bounds? It seems that it should be possible to at least constraint our algorithms to be PSPACE in some sense, but not obvious how to go beyond that (maybe it would require the frugal universal prior).
Can we argue that π∗ must be an infra-Bayesian learning algorithm? Relatedly, can we make a variant where computable/space-bounded policies can only attain some part of the optimal asymptotic reward of μ?
The setting we described requires that all the traps in μ can be described in a finite number of bits. If this is not the case, can we make a similar sort of an argument that implies π∗ is Bayes-optimal for some prior over a large hypothesis class?
Probably, making this argument rigorous requires replacing the limit with a particular regret bound. I ignore this for the sake of simplifying the core idea.
There probably is something more precise that can be said about how “large” this family of environment is. For example, maybe it must be uncountable.