The main thing at least for me, is that you seem to be the biggest proponent of scalable alignment and you are able to defend this concept very well. All of your proposals seem very much down-to-earth.
No problem posting your questions here. I’m not sure of the best place but I don’t think clutter is an issue, since LW organizes everything rather well and has good UIs to navigate it.
Pausing is hard. That’s why my scenarios barely involve pauses and only address them because others see them as a possibility.
Basically I think we have to get alignment mostly right before it’s time to pause, for the reasons he gives. I just think we might be able to do that. Language model agents are a really ideal alignment scenario, and instruction-following (IF) gives corrigibility for second chances when things start to go wrong. Asking an IF model about its alignment makes detecting misalignment easier, and re-aligning it is easy enough for the type of short pause that orgs and governments might actually do.
Moving slowly and carefully is hard too, and I don’t expect it. I expect the default alignment techniques to work if even a little effort is put in to making them work. Current models are aligned, and agents built from them will probably initially be aligned to. Letting them optimize into misalignment is pretty obvious and easy to stop.
To point 5 “the org has to do this plan”, I think having a good-enough plan is the default, not a tricky thing. Sure, they could fuck it up badly enough to drop the ball on what’s basically an easy project. In fact, I think they probably would if nobody thought misalignment was a risk. But if they take it halfway seriously, I expect the first alignment project to succeed. And I think they’ll take it halfway seriously once they’re actually talking to something smarter-than-human and fully agentic and capable of autonomy.
The difference here is in the specifics of alignment theory. Raemon’s scenarios do not include the advantages of corrigible/IF agents. Very few people are thinking both in terms of instruction-following for autonomous, self-directed AGI. Prosaic alignment thinkers are thinking of roughly but not precisely instruction-following agents, but not RSI-capable fully autonomous real AGI. Agent foundations people have largely accepted that “corrigibility is anti-natural” because Eliezer said it is—and he’s genuinely brilliant and a deep thinker. He just happens to be wrong on that point. He has chronic fatigue now, which can’t give him much time and energy to consider new ideas, and he’s got to be severely depressed at his perception that no one would listen to reason, so now we’re probably doomed. But I think he got this wrong initially because his perspective is in terms of fast takeoffs and maximizers/consequentialists. Slow takeoff with the non-consequentialist goal of following instructions is what we’re probably getting, and that changes the game entirely. Max Harms has defended [Corrigibility as Singular Target]( https://www.lesswrong.com/posts/NQK8KHSrZRF5erTba/0-cast-corrigibility-as-singular-target-1) for almost identical reasons to my optimism about IF, and he’s a former MIRI person who definitely understands EYs logic for thinking it wasn’t a possibility.
This conversation is useful because I’m working on a post on this, to be titled something like “a broader path: survival on the default fast timeline to AGI”.
I think there’s still plenty of work we can and should do to improve our odds, but I think we might survive even if that work all goes wrong. That doesn’t make me much of an optimist; I think we’ve got something right around 50⁄50 odds at this point, although of course that’s a very rough and uncertain estimates.
Edit: thanks for the compliment. I definitely try to keep my proposals down-to-earth. I don’t see any point in creating grand plans no one is going to follow.
The main thing at least for me, is that you seem to be the biggest proponent of scalable alignment and you are able to defend this concept very well. All of your proposals seem very much down-to-earth.
Sorry it’s taken me a while to get back to this.
No problem posting your questions here. I’m not sure of the best place but I don’t think clutter is an issue, since LW organizes everything rather well and has good UIs to navigate it.
I read that “Carefully Bootstrapped Alignment” is organizationally hard by Raemon and it did make some good points. Most of them I’d considered, but not all of them.
Pausing is hard. That’s why my scenarios barely involve pauses and only address them because others see them as a possibility.
Basically I think we have to get alignment mostly right before it’s time to pause, for the reasons he gives. I just think we might be able to do that. Language model agents are a really ideal alignment scenario, and instruction-following (IF) gives corrigibility for second chances when things start to go wrong. Asking an IF model about its alignment makes detecting misalignment easier, and re-aligning it is easy enough for the type of short pause that orgs and governments might actually do.
Moving slowly and carefully is hard too, and I don’t expect it. I expect the default alignment techniques to work if even a little effort is put in to making them work. Current models are aligned, and agents built from them will probably initially be aligned to. Letting them optimize into misalignment is pretty obvious and easy to stop.
To point 5 “the org has to do this plan”, I think having a good-enough plan is the default, not a tricky thing. Sure, they could fuck it up badly enough to drop the ball on what’s basically an easy project. In fact, I think they probably would if nobody thought misalignment was a risk. But if they take it halfway seriously, I expect the first alignment project to succeed. And I think they’ll take it halfway seriously once they’re actually talking to something smarter-than-human and fully agentic and capable of autonomy.
The difference here is in the specifics of alignment theory. Raemon’s scenarios do not include the advantages of corrigible/IF agents. Very few people are thinking both in terms of instruction-following for autonomous, self-directed AGI. Prosaic alignment thinkers are thinking of roughly but not precisely instruction-following agents, but not RSI-capable fully autonomous real AGI. Agent foundations people have largely accepted that “corrigibility is anti-natural” because Eliezer said it is—and he’s genuinely brilliant and a deep thinker. He just happens to be wrong on that point. He has chronic fatigue now, which can’t give him much time and energy to consider new ideas, and he’s got to be severely depressed at his perception that no one would listen to reason, so now we’re probably doomed. But I think he got this wrong initially because his perspective is in terms of fast takeoffs and maximizers/consequentialists. Slow takeoff with the non-consequentialist goal of following instructions is what we’re probably getting, and that changes the game entirely. Max Harms has defended [Corrigibility as Singular Target]( https://www.lesswrong.com/posts/NQK8KHSrZRF5erTba/0-cast-corrigibility-as-singular-target-1) for almost identical reasons to my optimism about IF, and he’s a former MIRI person who definitely understands EYs logic for thinking it wasn’t a possibility.
This conversation is useful because I’m working on a post on this, to be titled something like “a broader path: survival on the default fast timeline to AGI”.
I think there’s still plenty of work we can and should do to improve our odds, but I think we might survive even if that work all goes wrong. That doesn’t make me much of an optimist; I think we’ve got something right around 50⁄50 odds at this point, although of course that’s a very rough and uncertain estimates.
Edit: thanks for the compliment. I definitely try to keep my proposals down-to-earth. I don’t see any point in creating grand plans no one is going to follow.