Thanks for amplifying the post by that caused your large update. It’s pretty fascinating. I haven’t thought through it enough yet to know if I find it as compelling as you do.
Let me try to reproduce the argument of that post to see if I’ve got it:
If an agent already understands the world well (e.g., by extensive predictive training) before you start aligning it (e.g., with RLHF), then the alignment should be easier. The rewards you’re giving are probably attaching to the representations of the world that you want them to, because you and the model share a very similar model of
the world.
In addition, you really need long term goals for deceptive alignment to happen. Those aren’t present in current models, and there’s no obvious way for models to develop them if their training signals are local in time.
I agree with both of these points, and I think they’re really important—if we make systems with both of those properties, I think they’ll be safe.
I’m not sure that AGI will be trained such that it knows a lot before alignment starts. Nor am I sure that it won’t have long term goals. I think it will. Let alone ASI
But I think that tool AI might well continue be trained that way. And that will give us a little longer to work on alignment. But we aren’t likely to stop with tool AI, even if it is enough to transform the world for the better.
Thanks for amplifying the post by that caused your large update. It’s pretty fascinating. I haven’t thought through it enough yet to know if I find it as compelling as you do.
Let me try to reproduce the argument of that post to see if I’ve got it:
If an agent already understands the world well (e.g., by extensive predictive training) before you start aligning it (e.g., with RLHF), then the alignment should be easier. The rewards you’re giving are probably attaching to the representations of the world that you want them to, because you and the model share a very similar model of the world.
In addition, you really need long term goals for deceptive alignment to happen. Those aren’t present in current models, and there’s no obvious way for models to develop them if their training signals are local in time.
I agree with both of these points, and I think they’re really important—if we make systems with both of those properties, I think they’ll be safe.
I’m not sure that AGI will be trained such that it knows a lot before alignment starts. Nor am I sure that it won’t have long term goals. I think it will. Let alone ASI
But I think that tool AI might well continue be trained that way. And that will give us a little longer to work on alignment. But we aren’t likely to stop with tool AI, even if it is enough to transform the world for the better.