Olli Järviniemi comments on Trustworthy and untrustworthy models

Olli Järviniemi 22 Aug 2024 1:08 UTC
3 points
0
Quite a few of these seem very hard to do in a single forward pass.
Current models definitely don’t seem capable of doing this in a single forward pass. But we are possibly not far away from models being capable of doing basic forms of this reasoning with CoT (possibly we are already there), and then one needs to get quantitative: How authentic-looking red-teaming examples can humans generate? For performance reasons, we are probably giving Weak a CoT—how long is it? (We could monitor Weak’s CoT, though then you have more complexity, and Weak isn’t really trustworthy if you need to do that.) And probably it’s very easy for Weak to figure it’s being used as a monitor—it’s prompt literally contains instructions for what to report and what to not.
Overall this is messy, and someone needs to actually sit down and think through these arguments and then run the evaluations. (“Thinking through these arguments” is largely what I’m doing nowadays.) So while I am more optimistic about “scheming-risks” than I have been before, there’s work to be done here.
Rephrased, we can use Weak to automate non-trivial chunks of safety research and as an internals monitor, too (not just on externalized reasoning).
We definitely want to do some monitoring on internals (e.g. “coup probes”). The worry is that it’s hard to get confidence that such methods work right when a model is actually defecting, if one doesn’t in advance have any examples of the models doing that and has to go with synthetic analogues.