Another remark about Turing reinforcement learning (I renamed it so because, it’s a reinforcement learner “plugged” into a universal Turing machine, and it is also related to Neural Turing machines). Here is how we can realize Abram’s idea that “we can learn a whole lot by reasoning before we even observe that situation”.
Imagine that the environment satisfies some hypothesis H1 that contains the evaluation of a program P1, and that P1 is relatively computationally expensive but not prohibitively so. If H1 is a simple hypothesis, and in particular P1 is a short program, then we can expect the agent to learn H1 from a relatively small number of samples. However, because of the computational cost of P1, exploitingH1 might be difficult (because there is a large latency between the time the agent knows it needs to evaluate P1 and the time it gets the answer). Now assume that P1 is actually equivalent (as a function) to a different program P2 which is long but relatively cheap. Then we can use P2 to describe hypothesis H2 which is also true, and which is easily exploitable (i.e. guarantees a higher payoff than H1) but which is complex (because P2 is long). The agent would need a very large number of (physical) samples to learn H2 directly. But, if the agent has plenty of computational resources for use during its “free time”, it can use them to learn the (correct but complex) hypothesis M:="H1=H2". Then, the conjunction of H1 and M guarantees the same high payoff as H2. Thus, TRL converges to exploiting the environment well by the virtue of “reasoning before observing”.
Moreover, we can plausibly translate this story into a rigorous desideratum along the following lines: for any conjunction of a “purely mathematical” hypothesis M and arbitrary (“physical”) hypothesis H, we can learn (=exploit) it given enough “mathematical” samples relatively to the complexity (prior probability) of M and enough “physical” samples relatively to the complexity of H.
Another remark about Turing reinforcement learning (I renamed it so because, it’s a reinforcement learner “plugged” into a universal Turing machine, and it is also related to Neural Turing machines). Here is how we can realize Abram’s idea that “we can learn a whole lot by reasoning before we even observe that situation”.
Imagine that the environment satisfies some hypothesis H1 that contains the evaluation of a program P1, and that P1 is relatively computationally expensive but not prohibitively so. If H1 is a simple hypothesis, and in particular P1 is a short program, then we can expect the agent to learn H1 from a relatively small number of samples. However, because of the computational cost of P1, exploiting H1 might be difficult (because there is a large latency between the time the agent knows it needs to evaluate P1 and the time it gets the answer). Now assume that P1 is actually equivalent (as a function) to a different program P2 which is long but relatively cheap. Then we can use P2 to describe hypothesis H2 which is also true, and which is easily exploitable (i.e. guarantees a higher payoff than H1) but which is complex (because P2 is long). The agent would need a very large number of (physical) samples to learn H2 directly. But, if the agent has plenty of computational resources for use during its “free time”, it can use them to learn the (correct but complex) hypothesis M:="H1=H2". Then, the conjunction of H1 and M guarantees the same high payoff as H2. Thus, TRL converges to exploiting the environment well by the virtue of “reasoning before observing”.
Moreover, we can plausibly translate this story into a rigorous desideratum along the following lines: for any conjunction of a “purely mathematical” hypothesis M and arbitrary (“physical”) hypothesis H, we can learn (=exploit) it given enough “mathematical” samples relatively to the complexity (prior probability) of M and enough “physical” samples relatively to the complexity of H.