Edit: I hope that I am not cluttering the comments by asking these questions. I am hoping to create a separate post where I list all the problems that were raised for the scalable alignment proposal and all the proposed solutions to them. So far, everything you said not only seemed sensible, but also plausible, so I extremely value your feedback.
I have found some other concerns about scalable oversight/iterative alignment, that come from this post by Raemon. They are mostly about the organizational side of scalable oversight:
Moving slowly and carefully is annoying. There’s a constant tradeoff about getting more done, and elevated risk. Employees who don’t believe in the risk will likely try to circumvent or goodhart the security procedures. Filtering for for employees willing to take the risk seriously (or training them to) is difficult. There’s also the fact that many security procedures are just security theater. Engineers have sometimes been burned on overzealous testing practices. Figuring out a set of practices that are actually helpful, that your engineers and researchers have good reason to believe in, is a nontrivial task.
Noticing when it’s time to pause is hard. The failure modes are subtle, and noticing things is just generally hard unless you’re actively paying attention, even if you’re informed about the risk. It’s especially hard to notice things that are inconvenient and require you to abandon major plans.
Getting an org to pause indefinitely is hard. Projects have inertia. My experience as a manager, is having people sitting around waiting for direction from me makes it hard to think. Either you have to tell people “stop doing anything” which is awkwardly demotivating, or “Well, I dunno, you figure it out something to do?” (in which case maybe they’ll be continuing to do capability-enhancing work without your supervision) or you have to actually give them something to do (which takes up cycles that you’d prefer to spend on thinking about the dangerous AI you’re developing). Even if you have a plan for what your capabilities or product workers should do when you pause, if they don’t know what that plans is, they might be worried about getting laid off. And then they may exert pressure that makes it feel harder to get ready to pause. (I’ve observed many management decisions where even though we knew what the right thing to do was, conversations felt awkward and tense and the manager-in-question developed an ugh field around it, and put it off)
People can just quit the company and work elsewhere if they don’t agree with the decision to pause. If some of your employees are capabilities researchers who are pushing the cutting-edge forward, you need them actually bought into the scope of the problem to avoid this failure mode. Otherwise, even though “you” are going slowly/carefully, your employees will go off and do something reckless elsewhere.
This all comes after an initial problem, which is that your org has to end up doing this plan, instead of some other plan. And you have to do the whole plan, not cutting corners. If your org has AI capabilities/scaling teams and product teams that aren’t bought into the vision of this plan, even if you successfully spin the “slow/careful AI plan” up within your org, the rest of your org might plow ahead.
The main thing at least for me, is that you seem to be the biggest proponent of scalable alignment and you are able to defend this concept very well. All of your proposals seem very much down-to-earth.
No problem posting your questions here. I’m not sure of the best place but I don’t think clutter is an issue, since LW organizes everything rather well and has good UIs to navigate it.
Pausing is hard. That’s why my scenarios barely involve pauses and only address them because others see them as a possibility.
Basically I think we have to get alignment mostly right before it’s time to pause, for the reasons he gives. I just think we might be able to do that. Language model agents are a really ideal alignment scenario, and instruction-following (IF) gives corrigibility for second chances when things start to go wrong. Asking an IF model about its alignment makes detecting misalignment easier, and re-aligning it is easy enough for the type of short pause that orgs and governments might actually do.
Moving slowly and carefully is hard too, and I don’t expect it. I expect the default alignment techniques to work if even a little effort is put in to making them work. Current models are aligned, and agents built from them will probably initially be aligned to. Letting them optimize into misalignment is pretty obvious and easy to stop.
To point 5 “the org has to do this plan”, I think having a good-enough plan is the default, not a tricky thing. Sure, they could fuck it up badly enough to drop the ball on what’s basically an easy project. In fact, I think they probably would if nobody thought misalignment was a risk. But if they take it halfway seriously, I expect the first alignment project to succeed. And I think they’ll take it halfway seriously once they’re actually talking to something smarter-than-human and fully agentic and capable of autonomy.
The difference here is in the specifics of alignment theory. Raemon’s scenarios do not include the advantages of corrigible/IF agents. Very few people are thinking both in terms of instruction-following for autonomous, self-directed AGI. Prosaic alignment thinkers are thinking of roughly but not precisely instruction-following agents, but not RSI-capable fully autonomous real AGI. Agent foundations people have largely accepted that “corrigibility is anti-natural” because Eliezer said it is—and he’s genuinely brilliant and a deep thinker. He just happens to be wrong on that point. He has chronic fatigue now, which can’t give him much time and energy to consider new ideas, and he’s got to be severely depressed at his perception that no one would listen to reason, so now we’re probably doomed. But I think he got this wrong initially because his perspective is in terms of fast takeoffs and maximizers/consequentialists. Slow takeoff with the non-consequentialist goal of following instructions is what we’re probably getting, and that changes the game entirely. Max Harms has defended [Corrigibility as Singular Target]( https://www.lesswrong.com/posts/NQK8KHSrZRF5erTba/0-cast-corrigibility-as-singular-target-1) for almost identical reasons to my optimism about IF, and he’s a former MIRI person who definitely understands EYs logic for thinking it wasn’t a possibility.
This conversation is useful because I’m working on a post on this, to be titled something like “a broader path: survival on the default fast timeline to AGI”.
I think there’s still plenty of work we can and should do to improve our odds, but I think we might survive even if that work all goes wrong. That doesn’t make me much of an optimist; I think we’ve got something right around 50⁄50 odds at this point, although of course that’s a very rough and uncertain estimates.
Edit: thanks for the compliment. I definitely try to keep my proposals down-to-earth. I don’t see any point in creating grand plans no one is going to follow.
Edit: I hope that I am not cluttering the comments by asking these questions. I am hoping to create a separate post where I list all the problems that were raised for the scalable alignment proposal and all the proposed solutions to them. So far, everything you said not only seemed sensible, but also plausible, so I extremely value your feedback.
I have found some other concerns about scalable oversight/iterative alignment, that come from this post by Raemon. They are mostly about the organizational side of scalable oversight:
The main thing at least for me, is that you seem to be the biggest proponent of scalable alignment and you are able to defend this concept very well. All of your proposals seem very much down-to-earth.
Sorry it’s taken me a while to get back to this.
No problem posting your questions here. I’m not sure of the best place but I don’t think clutter is an issue, since LW organizes everything rather well and has good UIs to navigate it.
I read that “Carefully Bootstrapped Alignment” is organizationally hard by Raemon and it did make some good points. Most of them I’d considered, but not all of them.
Pausing is hard. That’s why my scenarios barely involve pauses and only address them because others see them as a possibility.
Basically I think we have to get alignment mostly right before it’s time to pause, for the reasons he gives. I just think we might be able to do that. Language model agents are a really ideal alignment scenario, and instruction-following (IF) gives corrigibility for second chances when things start to go wrong. Asking an IF model about its alignment makes detecting misalignment easier, and re-aligning it is easy enough for the type of short pause that orgs and governments might actually do.
Moving slowly and carefully is hard too, and I don’t expect it. I expect the default alignment techniques to work if even a little effort is put in to making them work. Current models are aligned, and agents built from them will probably initially be aligned to. Letting them optimize into misalignment is pretty obvious and easy to stop.
To point 5 “the org has to do this plan”, I think having a good-enough plan is the default, not a tricky thing. Sure, they could fuck it up badly enough to drop the ball on what’s basically an easy project. In fact, I think they probably would if nobody thought misalignment was a risk. But if they take it halfway seriously, I expect the first alignment project to succeed. And I think they’ll take it halfway seriously once they’re actually talking to something smarter-than-human and fully agentic and capable of autonomy.
The difference here is in the specifics of alignment theory. Raemon’s scenarios do not include the advantages of corrigible/IF agents. Very few people are thinking both in terms of instruction-following for autonomous, self-directed AGI. Prosaic alignment thinkers are thinking of roughly but not precisely instruction-following agents, but not RSI-capable fully autonomous real AGI. Agent foundations people have largely accepted that “corrigibility is anti-natural” because Eliezer said it is—and he’s genuinely brilliant and a deep thinker. He just happens to be wrong on that point. He has chronic fatigue now, which can’t give him much time and energy to consider new ideas, and he’s got to be severely depressed at his perception that no one would listen to reason, so now we’re probably doomed. But I think he got this wrong initially because his perspective is in terms of fast takeoffs and maximizers/consequentialists. Slow takeoff with the non-consequentialist goal of following instructions is what we’re probably getting, and that changes the game entirely. Max Harms has defended [Corrigibility as Singular Target]( https://www.lesswrong.com/posts/NQK8KHSrZRF5erTba/0-cast-corrigibility-as-singular-target-1) for almost identical reasons to my optimism about IF, and he’s a former MIRI person who definitely understands EYs logic for thinking it wasn’t a possibility.
This conversation is useful because I’m working on a post on this, to be titled something like “a broader path: survival on the default fast timeline to AGI”.
I think there’s still plenty of work we can and should do to improve our odds, but I think we might survive even if that work all goes wrong. That doesn’t make me much of an optimist; I think we’ve got something right around 50⁄50 odds at this point, although of course that’s a very rough and uncertain estimates.
Edit: thanks for the compliment. I definitely try to keep my proposals down-to-earth. I don’t see any point in creating grand plans no one is going to follow.