In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:
How to evaluate whether models should be trusted or untrusted: currently I don’t have a good answer and this is bottlenecking the efforts to write concrete control proposals.
How AI control should interact with AI security tools inside labs.
More generally:
How can we get more evidence on whether scheming is plausible?
“How can we get more evidence on whether scheming is plausible?”—What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.
In terms of developing better misalignment risk countermeasures, I think the most important questions are probably:
How to evaluate whether models should be trusted or untrusted: currently I don’t have a good answer and this is bottlenecking the efforts to write concrete control proposals.
How AI control should interact with AI security tools inside labs.
More generally:
How can we get more evidence on whether scheming is plausible?
How scary is underelicitation? How much should the results about password-locked models or arguments about being able to generate small numbers of high-quality labels or demonstrations affect this?
“How can we get more evidence on whether scheming is plausible?”—What if we ran experiments where we included some pressure towards scheming (either RL or fine-tuning) and we attempted to determine the minimum such pressure required to cause scheming? We could further attempt to see how this interacts with scaling.