I think CIRL is pretty promising as an alignment approach for certain approaches to building AGI (notably though, not promising for scaled-up deep learning).
I also think most of the reasons people give for being skeptical of CIRL (including everything currently on this post) are pretty bad.
I’m not going to defend this view here, when I’ve tried in the past it hasn’t made any difference.
I broadly agree that assistance games are a pretty great framework. The main reason I don’t work on them is that it doesn’t seem like it works as a solution if you expect AGI via scaled up deep learning. (Whereas I’d be pretty excited about pushing forward on it if it looked like we were getting AGI via things like explicit hierarchical planning or search algorithms.)
The main difference in the deep learning case is that with scaled up deep learning it looks like you are doing a search over programs for a program that performs well on your loss function, and the intelligent thing is the learned program as opposed to the search that found the learned program. if you wanted assistance-style safety, then the learned program needs to reason in a assistance-like way (i.e. maintain uncertainty over what the humans want, and narrow down the uncertainty by observing human behavior).
But then you run into a major problem, which is that we have no idea how to design the learned program, precisely because it is learned — all we do is constrain the behavior of the learned program on the particular inputs that we trained on, and there are many programs you could learn that have that behavior, some of which reason in a CIRL-like way and some of which don’t. (If you then try to solve this problem, you end up regenerating many of the directions that other alignment people work on.)
I think CIRL is pretty promising as an alignment approach for certain approaches to building AGI (notably though, not promising for scaled-up deep learning).
I also think most of the reasons people give for being skeptical of CIRL (including everything currently on this post) are pretty bad.
I’m not going to defend this view here, when I’ve tried in the past it hasn’t made any difference.
Do you have available URLs to comments/posts where you have done so in the past?
I’ve done it most via in-person conversations and private Slacks, but here’s one. I also endorse Paul’s comment.
Why isn’t promising for scaled-up Deep Learning specifically and what kind of approach might it be promising with?
Copying a private comment I wrote recently:
I’m not sure why Rohin thinks the arguments against CIRL are bad, but I wrote a post today on why I think the argument from fully updated deference / corrigibility is weak. I also found Paul Christiano’s response very helpful as an outline of objections to the utility uncertainty agenda.
Also relevant is this old comment from Rohin on difficulties with utility uncertainty.
I also just remembered this comment, which is more recent and has more details. Also I agree with Paul’s response.
If you had a defense of the idea, or a link to one I could read, I would be very interested to hear it. I wasn’t trying to be dogmatically skeptical.
Responded above