This seems to presume that you can divide up research topics into “alignment” vs “control” but this seems wrong to me.
Agreed. (And I made a comment to this effect on Mark’s draft of this short form.) I don’t think it is that productive to classify research directions, the central thing we’re trying to point at is a classification for reasons why you might think that you’re fine where one reason is “the AI couldn’t do bad stuff without getting caught” and another reason is “the AI wouldn’t do bad stuff” (obviously simplifying somewhat).
I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.)
The claims I’d like to make is:
There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I’d guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like “get AIs to be more useful at alignment research and get better at DC eval methodology” which should also be subsidized from a control perspective but seem non-central.)
When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be roughly equally excited about thinking about the control story. I also find that the control story is often more straightforward.
People thinking about safety plans for near term powerful AI should strongly consider control.
Agreed. (And I made a comment to this effect on Mark’s draft of this short form.) I don’t think it is that productive to classify research directions, the central thing we’re trying to point at is a classification for reasons why you might think that you’re fine where one reason is “the AI couldn’t do bad stuff without getting caught” and another reason is “the AI wouldn’t do bad stuff” (obviously simplifying somewhat).
I also think that white-box control should typically be considered pretty separately from black-box control and the story for it is often pretty different. (Though when thinking about white-box control you should often adopt the central control perspective, e.g., you should incorporate white-box runtime detection techniques (like probes) into an overall control evaluation via exogenously trying to determine the probe reliability and other important aspects of the probe. For instance, the probe can take advantage of black box strategies and how safe we are depends on how many actions AIs need to cause huge problems.)
The claims I’d like to make is:
There should be more research specifically targeting a black-box control story and which is operating from a black-box control perspective. (I’d guess like 20% of the overall AI safety portfolio should be this and maybe 30% of safety work at labs. This is not including work like “get AIs to be more useful at alignment research and get better at DC eval methodology” which should also be subsidized from a control perspective but seem non-central.)
When people do research which could be pretty useful from a black-box control perspective (e.g. debate or other scalable oversight style research), they should consider sometimes explicitly think about the control perspective and consider adjusting their research accordingly (to the extent they can understand this perspective well enough). To the extent that people have an alignment style story they think about while doing their research, they should be roughly equally excited about thinking about the control story. I also find that the control story is often more straightforward.
People thinking about safety plans for near term powerful AI should strongly consider control.