Welcome to LW! I feel like making this kind of post is almost a tradition. Don’t take the downvotes too hard.
Mild optimization is definitely something worth pursuing. But as Anon points out, in the story here, basically none of the work is being done by that; the AI already has ~half its optimization devoted to good stuff humans want, so it just needs one more bit to ~always do good stuff. But specifying good stuff humans want takes a whole lot of bits—whatever got those bits (minus one) into the AI is the real workhorse there.
Thank you! It’s particularly embarrassing to write a stereotypical newbie post since I’ve been thinking about this and reading LW and related since maybe 2004, and have been a true believer in the difficulty of aligning maximizers until re-engaging recently. Your way of phrasing it clicks for me, and I think you’re absolutely correct about where most of the work is being done in this fable. This post didn’t get at the question I wanted, because it’s implying that aligning an RL model will be easy if we try. And I don’t believe that. I agree with you that shard theory requires magic. There are some interesting arguments recently (here and here) that aligning an RL system might be easy if it has a good world model when we start aligning it, but I don’t think that’s probably a workable approach for practical reasons.
It was my intent to portray a situation where much less than half of the training went to alignment, and that little bit might still be stable and useful. But I’d need to paint a less rosy picture of the effort and outcome to properly convey that.
Welcome to LW! I feel like making this kind of post is almost a tradition. Don’t take the downvotes too hard.
Mild optimization is definitely something worth pursuing. But as Anon points out, in the story here, basically none of the work is being done by that; the AI already has ~half its optimization devoted to good stuff humans want, so it just needs one more bit to ~always do good stuff. But specifying good stuff humans want takes a whole lot of bits—whatever got those bits (minus one) into the AI is the real workhorse there.
Thank you! It’s particularly embarrassing to write a stereotypical newbie post since I’ve been thinking about this and reading LW and related since maybe 2004, and have been a true believer in the difficulty of aligning maximizers until re-engaging recently. Your way of phrasing it clicks for me, and I think you’re absolutely correct about where most of the work is being done in this fable. This post didn’t get at the question I wanted, because it’s implying that aligning an RL model will be easy if we try. And I don’t believe that. I agree with you that shard theory requires magic. There are some interesting arguments recently (here and here) that aligning an RL system might be easy if it has a good world model when we start aligning it, but I don’t think that’s probably a workable approach for practical reasons.
It was my intent to portray a situation where much less than half of the training went to alignment, and that little bit might still be stable and useful. But I’d need to paint a less rosy picture of the effort and outcome to properly convey that.