Adversarial training makes it harder for the malignant 1% to do anything dangerous, since it makes it harder to tell whether it’s in the test distribution. But it does little to prevent a malignant 1% from very conservatively trying (and usually failing) to ascertain whether it’s in the test distribution.
The question is roughly: can you be good enough at finding problematic inputs that, if a problematic input exists, it is much more likely to be found during the adversarial training step than at deployment time? If so, then a model with high reward probably won’t do anything bad at deployment time (exactly how good you need to be depends on exactly how overwhelmingly strong the prior for malign behavior is).
This likely involves introducing some asymmetry between the adversary’s task and the test time task. Examples of possible asymmetries including allowing the adversary to produce different kinds of inputs (e.g. setting activations arbitrarily), letting the adversary poke around inside the model to find bad cases (e.g. doing gradient descent, combining with interpretability, or backwards chaining through the model), or giving the adversary a much weaker job (e.g. only asking them to produce an input on which the model thinks anything remotely suspicious as judged by interpretability, or finding any conditions under which the model fails to instantly shut down a malicious scheme that it previously initiated).
(I don’t know what “prevent” means here.)
Interpretability makes it harder for the malignant 1% to be hidden, but doesn’t prevent malignant cognition it can’t detect.
I agree that if your interpretability doesn’t allow you to detect the agent thinking “Am I on the training distribution?” then it won’t work. So then the question is: can interpretability be strong enough that if the model is thinking something catastrophically dangerous, you have a very good chance of noticing it? (Again, how good you need to be depends on how overwhelming is the prior for bad behavior.)
It does prima facie seem that an agent spending 100% of its cognition being competent and corrigible achieves higher reward than an agent that only spends 99% of its cognition being competent and corrigible, and 1% of its cognition trying (and almost always failing) to see if it’s in the test distribution.
I don’t think this fact makes bad behavior unlikely on their own. But interpretability or adversarial training could exploit the fact that your model is trained by gradient descent. Regularization can strengthen the basic observation that a malign failure needs to use parts of the model that are exercised during training, making it easier to identify possible malign failures by examining behavior on the training distribution. We can potentially exploit the fact that gradient descent updates the model continuously.
I’m still confused about how each of the approaches would prevent us from eventually creating agents that spend 99% of their cognition acting corrigibly, while spending a well-hidden 1% of its cognition trying to sniff out whether it’s in the test distribution, and executing a treacherous turn if so.
(“Well-hidden” here means “well-hidden by gradient descent,” not “well-hidden by the agent,” right?)
The question is roughly: can you be good enough at finding problematic inputs that, if a problematic input exists, it is much more likely to be found during the adversarial training step than at deployment time? If so, then a model with high reward probably won’t do anything bad at deployment time (exactly how good you need to be depends on exactly how overwhelmingly strong the prior for malign behavior is).
This likely involves introducing some asymmetry between the adversary’s task and the test time task. Examples of possible asymmetries including allowing the adversary to produce different kinds of inputs (e.g. setting activations arbitrarily), letting the adversary poke around inside the model to find bad cases (e.g. doing gradient descent, combining with interpretability, or backwards chaining through the model), or giving the adversary a much weaker job (e.g. only asking them to produce an input on which the model thinks anything remotely suspicious as judged by interpretability, or finding any conditions under which the model fails to instantly shut down a malicious scheme that it previously initiated).
(I don’t know what “prevent” means here.)
I agree that if your interpretability doesn’t allow you to detect the agent thinking “Am I on the training distribution?” then it won’t work. So then the question is: can interpretability be strong enough that if the model is thinking something catastrophically dangerous, you have a very good chance of noticing it? (Again, how good you need to be depends on how overwhelming is the prior for bad behavior.)
I don’t think this fact makes bad behavior unlikely on their own. But interpretability or adversarial training could exploit the fact that your model is trained by gradient descent. Regularization can strengthen the basic observation that a malign failure needs to use parts of the model that are exercised during training, making it easier to identify possible malign failures by examining behavior on the training distribution. We can potentially exploit the fact that gradient descent updates the model continuously.
(“Well-hidden” here means “well-hidden by gradient descent,” not “well-hidden by the agent,” right?)