I broadly agree that assistance games are a pretty great framework. The main reason I don’t work on them is that it doesn’t seem like it works as a solution if you expect AGI via scaled up deep learning. (Whereas I’d be pretty excited about pushing forward on it if it looked like we were getting AGI via things like explicit hierarchical planning or search algorithms.)
The main difference in the deep learning case is that with scaled up deep learning it looks like you are doing a search over programs for a program that performs well on your loss function, and the intelligent thing is the learned program as opposed to the search that found the learned program. if you wanted assistance-style safety, then the learned program needs to reason in a assistance-like way (i.e. maintain uncertainty over what the humans want, and narrow down the uncertainty by observing human behavior).
But then you run into a major problem, which is that we have no idea how to design the learned program, precisely because it is learned — all we do is constrain the behavior of the learned program on the particular inputs that we trained on, and there are many programs you could learn that have that behavior, some of which reason in a CIRL-like way and some of which don’t. (If you then try to solve this problem, you end up regenerating many of the directions that other alignment people work on.)
Why isn’t promising for scaled-up Deep Learning specifically and what kind of approach might it be promising with?
Copying a private comment I wrote recently: