Vanessa Kosoy comments on How important are MDPs for AGI (Safety)?

Vanessa Kosoy Mar 29, 2020, 1:28 PM
LW: 2 AF: 1
AF

Regarding regret bounds, I don’t think regret bounds are realistic for an AGI, unless it queried an optimal teacher for every action (which would make it useless). In the real world, no actions are recoverable, and any time picks an action on its own, we cannot be sure it is acting optimally.

First, you can have a subjective regret bound which doesn’t require all actions to be recoverable (it does require some actions to be approximately recoverable, which is indeed the case in the real world).

Second, dealing rationally with non-recoverable actions should still translate into mathematical conditions some of which might still look like sort of regret bounds, and in any case finite MDPs are a natural starting point for analyzing them.

Third, checking regret bounds for priors in which all actions are recoverable serves as a sanity test for candidate AGI algorithms. It is not a sufficient desideratum, but I do think it is necessary.

But I think many of the difficulties with general intelligence are not captured in the simple setting

I agree that some of the difficulties are not captured. I am curious whether you have more concrete examples in mind than what you wrote in the post?

I don’t quite know what to think of continuous MDPs. I’ll wildly and informally conjecture that if the state space is compact, and if the transitions are Lipschitz continuous with respect to the state, it’s not a whole lot more powerful than the finite-state MDP formalism.

This seems wrong to me. Can you elaborate what do you mean by “powerful” in this context? Continuous MDPs definitely describe a large variety of environments that cannot be captured by a finite state MDP, at least not without approximations. Solving continuous MDPs can also be much more difficult than finite state MDPs. For example, any POMDP can be made into a continuous MDP by treating beliefs as states, and finding the optimal policy for a POMDP is PSPACE-hard (as opposed to the case of finite state MDPs which is P-easy).

But the upshot of those MDP techniques is mainly to not search through same plans twice, and if we have an advanced agent that is managing to not evaluate many plans even once, I think there’s a good chance that we’ll get for free the don’t-evaluate-plans-twice behavior.

I guess that you might be thinking exclusively of algorithms that have something like a uniform prior over transition kernels. In this case there is obviously no way to learn about a state without visiting it. But we can also consider algorithms with more sophisticated priors and get much faster learning rates (if the environment is truly sampled from this prior ofc). The best example is, I think, the work of Osband and Van Roy where a regret bound is derived that scales with a certain dimension parameter of the hypothesis space (that can be much smaller than the number of states and actions), work on which I continued to build.