Here’s one that I do believe is mistaken in a hopeful direction:
6. We need to align the performance of some large task, a ‘pivotal act’ that prevents other people from building an unaligned AGI that destroys the world. While the number of actors with AGI is few or one, they must execute some “pivotal act”, strong enough to flip the gameboard, using an AGI powerful enough to do that. It’s not enough to be able to align a weak system—we need to align a system that can do some single very large thing. The example I usually give is “burn all GPUs”. [..]
It could actually be enough to align a weak system. This is the case where the system is “weak” in the sense that it can’t perform a pivotal act on its own, but it’s powerful enough that it can significantly accelerate development toward a stronger aligned AI/AGI with pivotal act potential.
This case is important because it helps to break down and simplify the problem. Thinking about how to build an extremely powerful aligned AI which can do a pivotal act is more daunting than thinking about how to build a weaker-but-still-impressive aligned AI that is useful for building more powerful aligned AIs.
I can think as well as anybody currently does about alignment, and I don’t see any particular bit of clever writing software that is likely to push me over the threshold of being able to save the world, or any nondangerous AGI capability to add to myself that does that trick. Seems to just fall again into the category of hypothetical vague weak pivotal acts that people can’t actually name and don’t actually exist.
It could actually be enough to align a weak system. This is the case where the system is “weak” in the sense that it can’t perform a pivotal act on its own, but it’s powerful enough that it can significantly accelerate development toward a stronger aligned AI/AGI with pivotal act potential.
What specific capabilities will this weak AI have that lets you cross the distributional shift?
I think this sort of thing is not impossible, but I think it needs to have a framing like “I will write a software program that will make it slightly easier for me to think, and then I will solve the problem” and not like “I will write an AI that will do some complicated thought which can’t be so complicated that it’s dangerous, and there’s a solution in that space.” By the premise, the only safe thoughts are simple ones, and so if you have a specific strategy that could lead to alignment breakthrus but just needs to run lots of simple for loops or w/e, the existence of that strategy is the exciting fact, not the meta-strategy of “humans with laptops can think better than humans with paper.”
I agree with many of the points in this post.
Here’s one that I do believe is mistaken in a hopeful direction:
It could actually be enough to align a weak system. This is the case where the system is “weak” in the sense that it can’t perform a pivotal act on its own, but it’s powerful enough that it can significantly accelerate development toward a stronger aligned AI/AGI with pivotal act potential.
This case is important because it helps to break down and simplify the problem. Thinking about how to build an extremely powerful aligned AI which can do a pivotal act is more daunting than thinking about how to build a weaker-but-still-impressive aligned AI that is useful for building more powerful aligned AIs.
I can think as well as anybody currently does about alignment, and I don’t see any particular bit of clever writing software that is likely to push me over the threshold of being able to save the world, or any nondangerous AGI capability to add to myself that does that trick. Seems to just fall again into the category of hypothetical vague weak pivotal acts that people can’t actually name and don’t actually exist.
Why not?
Oh. I see your point.
What specific capabilities will this weak AI have that lets you cross the distributional shift?
I think this sort of thing is not impossible, but I think it needs to have a framing like “I will write a software program that will make it slightly easier for me to think, and then I will solve the problem” and not like “I will write an AI that will do some complicated thought which can’t be so complicated that it’s dangerous, and there’s a solution in that space.” By the premise, the only safe thoughts are simple ones, and so if you have a specific strategy that could lead to alignment breakthrus but just needs to run lots of simple for loops or w/e, the existence of that strategy is the exciting fact, not the meta-strategy of “humans with laptops can think better than humans with paper.”