Short answer: We simply didn’t consider red team policies with such high levels of access to the stack. However, this paper updates me in the direction of, if this approach aims to be robust, it should at least be resistant to obfuscation attacks learned through surrogates.
Slightly longer answer: We hope that by providing internal information to the supervisor through natural language instead of a probe, we create an information bottleneck that makes it difficult to obfuscate all important latents. (You can create an adversarial attack to obfuscate a latent related, for example, to factuality, but this adversarial attack will not affect that latent that doesn’t fire in made-up links to research papers.)
Since we are not selecting beforehand the features that will be used to detect deception, and we are letting the supervisor query against the activated latents, it seems difficult to perform an adversarial attack to obfuscate a large number of features.
How would your approach handle techniques to obfuscate latents and thus frustrate SAEs e.g. https://arxiv.org/html/2412.09565v1 ?
Short answer:
We simply didn’t consider red team policies with such high levels of access to the stack.
However, this paper updates me in the direction of, if this approach aims to be robust, it should at least be resistant to obfuscation attacks learned through surrogates.
Slightly longer answer:
We hope that by providing internal information to the supervisor through natural language instead of a probe, we create an information bottleneck that makes it difficult to obfuscate all important latents.
(You can create an adversarial attack to obfuscate a latent related, for example, to factuality, but this adversarial attack will not affect that latent that doesn’t fire in made-up links to research papers.)
Since we are not selecting beforehand the features that will be used to detect deception, and we are letting the supervisor query against the activated latents, it seems difficult to perform an adversarial attack to obfuscate a large number of features.