Rohin Shah comments on Alignment proposals and complexity classes

Rohin Shah 17 Jul 2020 19:31 UTC
LW: 17 AF: 8
0
AF
Planned summary for the Alignment Newsletter:
The original <@debate@>(@AI safety via debate@) paper showed that any problem in PSPACE can be solved by optimal play in a debate game judged by a (problem-specific) algorithm in P. Intuitively, this is an illustration of how the mechanism of debate can take a weak ability (the ability to solve arbitrary problems in P) and amplify it into a stronger ability (the ability to solve arbitrary problems in PSPACE). One would hope that similarly, debate would allow us to amplify a human’s problem-solving ability into a much stronger problem-solving ability.
This post applies this technique to several other alignment proposals. In particular, for each proposal, we assume that the “human” can be an arbitrary polynomial-time algorithm, and the AI models are optimal w.r.t their loss functions, and we ask which problems we can solve using these capabilities. The post finds that, as lower bounds, the various forms of amplification can access PSPACE, while <@market making@>(@AI safety via market making@) can access EXP. If there are untamperable pointers (so that the polynomial-time algorithm can look at objects of an arbitrary size, as long as it only looks at a polynomial-sized subset of them), then amplification and market making can access R (the set of decidable problems).
Planned opinion:
In practice our models are not going to reach the optimal loss, and humans won’t solve arbitrary polynomial-time problems, so these theorems won’t directly apply to reality. Nonetheless, this does seem like a worthwhile check to do—it feels similar to ensuring that a deep RL algorithm has a proof of convergence under idealized assumptions, even if those assumptions won’t actually hold in reality. I have much more faith in a deep RL algorithm that started from one with a proof of convergence and then was modified based on empirical considerations.