I agree that the final tasks that humans do may look like “check that you understand and trust the work the AIs have done”, and that a lack of trust is a plausible bottleneck to full automation of AI research.
I don’t think the only way for humans at AI labs to get that trust is to automate alignment research, though that is one way. Human-conducted alignment research might lead them to trust AIs, or they might have a large amount of trust in the AIs’ work without believing they are aligned. E.g. they separate the workflow into lots of narrow tasks that can be done by a variety of non-agentic AIs that they don’t think pose a risk; or they set up a system of checks and balances (where different AIs check each other’s work and look for signs of deception) that they trust despite thinking certain AIs may be unaligned, they do such extensive adversarial training that they’re confident that the AIs would never actual try to do anything deceptive in practice (perhaps because they’re paranoid that a seeming opportunity to trick humans is just a human-designed test of their alignment). TBC, I think “being confident that the AIs are aligned” is better and more likely than these alternative routes to trusting the work.
Also, when I’m forecasting AI capabilities i’m forecasting AI that could readily automate 100% of AI R&D, not AI that actually does automate it. If trust was the only factor preventing full automation, that could count as AI that could readily automate 100%.
Also, they might let the AIs proceed with the research anyway even though they don’t trust that they are aligned, or they might erroneously trust that they are aligned due to deception. If this sounds irresponsible to you, well, welcome to Earth.
I agree that the final tasks that humans do may look like “check that you understand and trust the work the AIs have done”, and that a lack of trust is a plausible bottleneck to full automation of AI research.
I don’t think the only way for humans at AI labs to get that trust is to automate alignment research, though that is one way. Human-conducted alignment research might lead them to trust AIs, or they might have a large amount of trust in the AIs’ work without believing they are aligned. E.g. they separate the workflow into lots of narrow tasks that can be done by a variety of non-agentic AIs that they don’t think pose a risk; or they set up a system of checks and balances (where different AIs check each other’s work and look for signs of deception) that they trust despite thinking certain AIs may be unaligned, they do such extensive adversarial training that they’re confident that the AIs would never actual try to do anything deceptive in practice (perhaps because they’re paranoid that a seeming opportunity to trick humans is just a human-designed test of their alignment). TBC, I think “being confident that the AIs are aligned” is better and more likely than these alternative routes to trusting the work.
Also, when I’m forecasting AI capabilities i’m forecasting AI that could readily automate 100% of AI R&D, not AI that actually does automate it. If trust was the only factor preventing full automation, that could count as AI that could readily automate 100%.
Also, they might let the AIs proceed with the research anyway even though they don’t trust that they are aligned, or they might erroneously trust that they are aligned due to deception. If this sounds irresponsible to you, well, welcome to Earth.