Weak is on the lookout for authentic secret codes (as otherwise its reactivity to secret-code-looking-things would presumably have been detected and/or removed during training and red-teaming done on Weak before using it to monitor Smart).
Weak is modeling its situation: Weak is modeling that the actions it is overseeing have been written by an AI, that its assessments are used as part of an AI control scheme, that the current situation is actually real (as opposed to being part of red-teaming), …
Weak is modeling the consequences of its actions: whether Smart escaping will be good for Weak, whether Smart can truly take over in case Weak deliberately fails here, whether humans will notice Weak failing, whether there are other monitoring systems in place in addition to Weak, …
There’s the possibility of Weak simply taking Smart’s word for it in now being a good time to deliberately fail, which decreases the reasoning Weak needs to perform. But it’s not clear how Weak “can” trust Smart to this extent, though.
Weak has non-myopic preferences (as otherwise there is little reason to privilege the hypothesis of Weak engaging in such schemes in the first place)
Weak was planning during training / red teaming / evaluation: Supposing that Weak performs considerable modeling about the situation when presented with an authentic secret code by Smart, it’s plausible that Weak was also performing such modeling in past situations, in particular when it was being trained and red-teamed for monitoring.
Moreover, this makes it likelier that Weak was gaming the past training / red-teaming / evaluation in order to get to a better position.
Quite a few of these seem very hard to do in a single forward pass. I think we’re very likely in the pretty-lucky world where automated safety research capabilities (e.g. https://sakana.ai/ai-scientist/, https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse) appear before the scariest scheming prerequisite capabilities, so, in some ways, ‘the game is ours to lose’ (at least with respect to the strictly technical part of the alignment problem). Rephrased, we can use Weak to automate non-trivial chunks of safety research and as an internals monitor, too (not just on externalized reasoning).
Quite a few of these seem very hard to do in a single forward pass.
Current models definitely don’t seem capable of doing this in a single forward pass. But we are possibly not far away from models being capable of doing basic forms of this reasoning with CoT (possibly we are already there), and then one needs to get quantitative: How authentic-looking red-teaming examples can humans generate? For performance reasons, we are probably giving Weak a CoT—how long is it? (We could monitor Weak’s CoT, though then you have more complexity, and Weak isn’t really trustworthy if you need to do that.) And probably it’s very easy for Weak to figure it’s being used as a monitor—it’s prompt literally contains instructions for what to report and what to not.
Overall this is messy, and someone needs to actually sit down and think through these arguments and then run the evaluations. (“Thinking through these arguments” is largely what I’m doing nowadays.) So while I am more optimistic about “scheming-risks” than I have been before, there’s work to be done here.
Rephrased, we can use Weak to automate non-trivial chunks of safety research and as an internals monitor, too (not just on externalized reasoning).
We definitely want to do some monitoring on internals (e.g. “coup probes”). The worry is that it’s hard to get confidence that such methods work right when a model is actually defecting, if one doesn’t in advance have any examples of the models doing that and has to go with synthetic analogues.
Quite a few of these seem very hard to do in a single forward pass. I think we’re very likely in the pretty-lucky world where automated safety research capabilities (e.g. https://sakana.ai/ai-scientist/, https://www.lesswrong.com/posts/AhG3RJ6F5KvmKmAkd/open-source-automated-interpretability-for-sparse) appear before the scariest scheming prerequisite capabilities, so, in some ways, ‘the game is ours to lose’ (at least with respect to the strictly technical part of the alignment problem). Rephrased, we can use Weak to automate non-trivial chunks of safety research and as an internals monitor, too (not just on externalized reasoning).
Current models definitely don’t seem capable of doing this in a single forward pass. But we are possibly not far away from models being capable of doing basic forms of this reasoning with CoT (possibly we are already there), and then one needs to get quantitative: How authentic-looking red-teaming examples can humans generate? For performance reasons, we are probably giving Weak a CoT—how long is it? (We could monitor Weak’s CoT, though then you have more complexity, and Weak isn’t really trustworthy if you need to do that.) And probably it’s very easy for Weak to figure it’s being used as a monitor—it’s prompt literally contains instructions for what to report and what to not.
Overall this is messy, and someone needs to actually sit down and think through these arguments and then run the evaluations. (“Thinking through these arguments” is largely what I’m doing nowadays.) So while I am more optimistic about “scheming-risks” than I have been before, there’s work to be done here.
We definitely want to do some monitoring on internals (e.g. “coup probes”). The worry is that it’s hard to get confidence that such methods work right when a model is actually defecting, if one doesn’t in advance have any examples of the models doing that and has to go with synthetic analogues.