Thank you! It’s particularly embarrassing to write a stereotypical newbie post since I’ve been thinking about this and reading LW and related since maybe 2004, and have been a true believer in the difficulty of aligning maximizers until re-engaging recently. Your way of phrasing it clicks for me, and I think you’re absolutely correct about where most of the work is being done in this fable. This post didn’t get at the question I wanted, because it’s implying that aligning an RL model will be easy if we try. And I don’t believe that. I agree with you that shard theory requires magic. There are some interesting arguments recently (here and here) that aligning an RL system might be easy if it has a good world model when we start aligning it, but I don’t think that’s probably a workable approach for practical reasons.
It was my intent to portray a situation where much less than half of the training went to alignment, and that little bit might still be stable and useful. But I’d need to paint a less rosy picture of the effort and outcome to properly convey that.
Thank you! It’s particularly embarrassing to write a stereotypical newbie post since I’ve been thinking about this and reading LW and related since maybe 2004, and have been a true believer in the difficulty of aligning maximizers until re-engaging recently. Your way of phrasing it clicks for me, and I think you’re absolutely correct about where most of the work is being done in this fable. This post didn’t get at the question I wanted, because it’s implying that aligning an RL model will be easy if we try. And I don’t believe that. I agree with you that shard theory requires magic. There are some interesting arguments recently (here and here) that aligning an RL system might be easy if it has a good world model when we start aligning it, but I don’t think that’s probably a workable approach for practical reasons.
It was my intent to portray a situation where much less than half of the training went to alignment, and that little bit might still be stable and useful. But I’d need to paint a less rosy picture of the effort and outcome to properly convey that.