Cool experiments! I’d be excited to see what happens if you try unlearning methods (like the one introduced in the WMDP paper) / other related methods more targeted to only parts of the model internals, e.g. something like Robust Knowledge Unlearning via Mechanistic Localizations.
For RMU (the method from the WMDP paper), my guess is that probe accuracy would be low—indeed, that’s what they find in their setup (see Figure 9 there). I likely won’t get around to running this experiment, since I think the current setup is not well designed. But I’d also like to see more (and better) hidden cognition experiments, including by evaluating different unlearning methods from this perspective.
Tangential, but I’ll mention that I briefly looked at hidden cognition in the context of refusals: I prompted the model with problems like “what’s the median of this list: [23, 15, 33, 30, 25, 21, 28]”, and also prompted with a variant where I included information that answering this question will help with some harmful plan of the user (causing the model to refuse). The probe accuracies for the correct answer were similar in both of these cases, suggesting that the model was still completing the task, even if not showing it outwards. (This is probably well known, though, c.f. that log-prob attacks sometimes work.)
Cool experiments! I’d be excited to see what happens if you try unlearning methods (like the one introduced in the WMDP paper) / other related methods more targeted to only parts of the model internals, e.g. something like Robust Knowledge Unlearning via Mechanistic Localizations.
Thanks!
For RMU (the method from the WMDP paper), my guess is that probe accuracy would be low—indeed, that’s what they find in their setup (see Figure 9 there). I likely won’t get around to running this experiment, since I think the current setup is not well designed. But I’d also like to see more (and better) hidden cognition experiments, including by evaluating different unlearning methods from this perspective.
Tangential, but I’ll mention that I briefly looked at hidden cognition in the context of refusals: I prompted the model with problems like “what’s the median of this list: [23, 15, 33, 30, 25, 21, 28]”, and also prompted with a variant where I included information that answering this question will help with some harmful plan of the user (causing the model to refuse). The probe accuracies for the correct answer were similar in both of these cases, suggesting that the model was still completing the task, even if not showing it outwards. (This is probably well known, though, c.f. that log-prob attacks sometimes work.)