I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.
First, the appendix hinges somewhat on this point.
For now, we think the easiest way to build control techniques that avoid making incorrect assumptions about scheming models is to make conservative assumptions about their internals, by rejecting any strategy which can be evaded by a red-team-constructed model.
I agree with this, but saying that “control evals should be black-box” is a non sequitur. This is making the point that bad reverse-engineering of the model is bad and seems to assume that good reverse-engineering is hard/unlikely implicitly.
Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning—and all of these things actually seem plausibly useful for control evals. Plus I don’t see why interp should be off limits for evaluators who know the limitations of the tools they are using.
Ultimately, white-box attacks offer strictly more options for any type of eval compared to black-box ones. For any property that an AI system might have, if there are two teams of competent people trying to verify if the system has that property, all else equal, I’d bet on the success of the team that has more access.
So if I understand things correctly, and if the intended point of this appendix is that “Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls,” I would recommend just saying that more directly.
Our claim is that when arguing that models are controlled, we should (with the current state of technology) probably not depend on arguments for safety which depend on properties of the internals of models other than the basic property “SGD is reasonably sample efficient”.
(For a more precise notion of the sample efficiency point, if you train the model on labeled data, the model will be forced to perform near maximally well at fitting this data in a reasonably sample efficient way. This is due to assuming the model isn’t capable of gradient hacking. This argument also implies some claims about what properties of RL you can depend on. We might also be able to evaluate whether models are capable of gradient hacking or look at concerns which are somewhere between gradient and exploration hacking.)
Another sense in which control evalutions aren’t exactly “black-box” is that these evaluations will use finetuning: the red team will finetuning the model to see if it is capable of exhibiting the problematic behavior. (And our actual protocol that we’re evaluating will indeed use finetuning, it’s just that we can’t depend on any properties of this finetuning other than sample efficiency property mentioned above.)
You might find it illuminating to read Appendix C of our recent paper.
So overall, it does seem like the title of that appendix could be misleading/confusing. Here’s an alterative description (which is a bit too long to be a title) “With the current state of technology, control evaluations should conservatively avoid depending on properties of the internals of models (e.g. interpretability) or generalization properties other than sample efficiency of SGD”. I’ll try to think of a less confusing and short title (and edit the section to be less confusing accordingly).
Edit: I’ll also generally try to be a bit more careful about the claims we make in this section. Like we’re not claiming that current internals/generalization-based methods provide no evidence for safety given an absence of serious evidence for danger, just that the evidence is pretty limited on top of various baselines for a few reasons.
Edit Edit: I’ve changed the title of the corresponding section to “Why control evaluations are conservative about inductive biases and AI internal structure” and added a few footnotes trying to clarify what we mean by “black-box evaluation”.
Responding more specifically to what you’re saying:
So if I understand things correctly, and if the intended point of this appendix is that “Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls,”
Yeah, this is basically the right read.
Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning—and all of these things actually seem plausibly useful for control evals.
We’re certainly not only talking about reverse engineering in that section, we’re talking about any potential assumption on internal structure (e.g. the sort of assumptions that gradient based attacks would depend upon to do anything).
Separately, we agree that you should of course try to use gradient based attacks and similar methods to improve safety and get evidence for misalignment. We’re just arguing that the absence of evidence for misalignment from gradient based attacks doesn’t constitute much evidence for safety.
Like consider the following situation:
We run gradient based attacks on our current model and this doesn’t turn up clear evidence for egregious misalignment.
When a red team produces a model naively for control evaluations, sometimes gradient based attacks work to find cases which cause that model to behave poorly. (In ways which don’t apply to the original model that the red team finetuned)
However, if the red team does some random manipulations of the internals, graident based attacks totally fail.
Our proposal for how you handle this situation is “don’t consider the absence of evidence from gradient based attacks to be much evidence for safety”. (Separately I’m skeptical that gradient based attacks will work in the test as described in the second bullet.)
Do you think models like this need to be kept static? For example if you allow the model to learn from its mistakes, etc, after this evaluation, or receive additional networks added to its architecture, doesn’t this negate any safety evals?
I ask because this is the obvious way to use ai in a business. Subscribe to a base model with unlocked weights and add additional networks optimized for RL or domain specific estimations.
I have been thinking a lot lately about evals and what differences black- vs. white-box access makes for evaluators. I was confused about the appendix on black-box evals. My best attempt to make sense of it hinges on two instances of the authors intending something different than what was written.
First, the appendix hinges somewhat on this point.
I agree with this, but saying that “control evals should be black-box” is a non sequitur. This is making the point that bad reverse-engineering of the model is bad and seems to assume that good reverse-engineering is hard/unlikely implicitly.
Second, even if good reverse-engineering is hard/unlikely, reverse-engineering is not the definition of a white-box eval. There are a lot of things that you can do with white-box access for evals other than interpretability-based things: gradient-based attacks, gradient-guided attacks, latent-space attacks, and finetuning—and all of these things actually seem plausibly useful for control evals. Plus I don’t see why interp should be off limits for evaluators who know the limitations of the tools they are using.
Ultimately, white-box attacks offer strictly more options for any type of eval compared to black-box ones. For any property that an AI system might have, if there are two teams of competent people trying to verify if the system has that property, all else equal, I’d bet on the success of the team that has more access.
So if I understand things correctly, and if the intended point of this appendix is that “Interpretability-based evals for control seem unreliable and possibly could lead to pitfalls,” I would recommend just saying that more directly.
Our claim is that when arguing that models are controlled, we should (with the current state of technology) probably not depend on arguments for safety which depend on properties of the internals of models other than the basic property “SGD is reasonably sample efficient”.
(For a more precise notion of the sample efficiency point, if you train the model on labeled data, the model will be forced to perform near maximally well at fitting this data in a reasonably sample efficient way. This is due to assuming the model isn’t capable of gradient hacking. This argument also implies some claims about what properties of RL you can depend on. We might also be able to evaluate whether models are capable of gradient hacking or look at concerns which are somewhere between gradient and exploration hacking.)
Another sense in which control evalutions aren’t exactly “black-box” is that these evaluations will use finetuning: the red team will finetuning the model to see if it is capable of exhibiting the problematic behavior. (And our actual protocol that we’re evaluating will indeed use finetuning, it’s just that we can’t depend on any properties of this finetuning other than sample efficiency property mentioned above.)
You might find it illuminating to read Appendix C of our recent paper.
So overall, it does seem like the title of that appendix could be misleading/confusing. Here’s an alterative description (which is a bit too long to be a title) “With the current state of technology, control evaluations should conservatively avoid depending on properties of the internals of models (e.g. interpretability) or generalization properties other than sample efficiency of SGD”. I’ll try to think of a less confusing and short title (and edit the section to be less confusing accordingly).
Edit: I’ll also generally try to be a bit more careful about the claims we make in this section. Like we’re not claiming that current internals/generalization-based methods provide no evidence for safety given an absence of serious evidence for danger, just that the evidence is pretty limited on top of various baselines for a few reasons.
Edit Edit: I’ve changed the title of the corresponding section to “Why control evaluations are conservative about inductive biases and AI internal structure” and added a few footnotes trying to clarify what we mean by “black-box evaluation”.
Responding more specifically to what you’re saying:
Yeah, this is basically the right read.
We’re certainly not only talking about reverse engineering in that section, we’re talking about any potential assumption on internal structure (e.g. the sort of assumptions that gradient based attacks would depend upon to do anything).
Separately, we agree that you should of course try to use gradient based attacks and similar methods to improve safety and get evidence for misalignment. We’re just arguing that the absence of evidence for misalignment from gradient based attacks doesn’t constitute much evidence for safety.
Like consider the following situation:
We run gradient based attacks on our current model and this doesn’t turn up clear evidence for egregious misalignment.
When a red team produces a model naively for control evaluations, sometimes gradient based attacks work to find cases which cause that model to behave poorly. (In ways which don’t apply to the original model that the red team finetuned)
However, if the red team does some random manipulations of the internals, graident based attacks totally fail.
Our proposal for how you handle this situation is “don’t consider the absence of evidence from gradient based attacks to be much evidence for safety”. (Separately I’m skeptical that gradient based attacks will work in the test as described in the second bullet.)
Thanks for the reply. I think this resolves the thread and puts us on the same page :)
Do you think models like this need to be kept static? For example if you allow the model to learn from its mistakes, etc, after this evaluation, or receive additional networks added to its architecture, doesn’t this negate any safety evals?
I ask because this is the obvious way to use ai in a business. Subscribe to a base model with unlocked weights and add additional networks optimized for RL or domain specific estimations.
No, see appendix C in our paper for a discussion of how to handle online training while being conservative about inductive biases.
(We can also approximate this as needed if that exact scheme is too complex or expensive.)