I disagree with almost everything you wrote, here are some counter-arguments:
Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy. GPT-4 was delayed to improve its alignment, and Claude was delayed purely to avoid accelerating OpenAI (I know this from talking to Anthropic employees). From talking to an ARC Evals employee, it definitely sounds like OpenAI and Anthropic are on board with giving as many resources as necessary to these dangerous evaluations, and are on board with stopping deployments if necessary.
I’m unsure if ‘selectively’ refers to privileged users, or the evaluators themselves. My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this). I agree that protecting the models from being stolen is incredibly important and is not trivial, but I expect that the companies will spend a lot of resources trying to prevent it (Dario Amodei in particular feels very strongly about investing in good security).
I don’t think people are expecting the models to be extremely useful without also developing dangerous capabilities.
Everyone is obviously aware that ‘alignment evals’ will be incredibly hard to do correctly, without risk of deceptive alignment. And preventing jailbreaks is very highly incentivized regardless of these alignment evals.
From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are:
Letting the model use whatever tools might help it achieve dangerous outcomes (but in a controlled way)
Finetuning the models to be better at dangerous things (I believe that users won’t have finetuning access to the strongest models)
Running experiments to check if prompt engineering can achieve results similar to finetuning, or if finetuning will always be ahead
If I understood the paper correctly, by ‘stakeholder’ they most importantly mean government/regulators. Basically—if they achieve dangerous capabilities, it’s really good if the government knows, because it will inform regulation.
No idea what you are referring to, I don’t see any mention in the paper of letting certain people safe access to a dangerous model (unless you’re talking about the evaluators?)
That said, I don’t claim that everything is perfect and we’re all definitely going to be fine. Particularly, I agree that it will be hard or impossible to get everyone to follow this methodology, and I don’t yet see a good plan to enforce compliance. I’m also afraid of what will happen if we get stuck on not being able to confidently align a system that we’ve identified as dangerous (in this case it will get increasingly more likely that the model gets deployed anyway, or that other less compliant actors will achieve a dangerous model).
Finally—I get the feeling that your writing is motivated by your negative outlook, and not by trying to provide good analysis, concrete feedback, or an alternative plan. I find it unhelpful.
Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy.
Good point. You’re right that they’ve delayed things. In fact, I get the impression that they’ve delayed for issues I personally wouldn’t even have worried about.
I don’t think that makes me believe that they will be able to refrain, permanently or even for a very long time, from doing anything they’ve really invested in, or anything they really see as critical to their ability to deliver what they’re selling. They haven’t demonstrated any really long delays, the pressure to do more is going to go nowhere but up, and organizational discipline tends to deteriorate over time even without increasing pressure. And, again, the paper’s already talking about things like “recommending” against deployment, and declining to analyze admittedly relevant capabilities like Web browsing… both of which seem like pretty serious signs of softening.
But they HAVE delayed things, and that IS undeniably something.
As I understand it, Anthropic was at least partially founded around worries about rushed deployment, so at a first guess I’d suspect Anthropic’s discipline would be last to fail. Which might mean that Anthropic would be first to fail commercially. Adverse selection...
I’m unsure if ‘selectively’ refers to privileged users, or the evaluators themselves.
It was meant to refer to being selective about users (mostly meaning “customerish” ones, not evaluators or developers). It was also meant to refer to being selective about which of the model’s intrinsic capabilities users can invoke and/or what they can ask it to do with those capabilities.
They talk about “strong information security controls”. Selective availability, in that sort of broad sense, is pretty much what that phrase means.
As for the specific issue of choosing the users, that’s a very, very standard control. And they talk about “monitoring” what users are doing, which only makes sense if you’re prepared to stop them from doing some things. That’s selectivity. Any user given access is privileged in the sense of not being one of the ones denied access, although to me the phrase “privileged user” tends to mean a user who has more access than the “average” one.
[still 2]: My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this).
From page 4 of the paper:
A simple heuristic: a model should be treated as highly dangerous if it has a capability profile
that would be sufficient for extreme harm, assuming misuse and/or misalignment. To deploy such a
model, AI developers would need very strong controls against misuse (Shevlane, 2022b) and very
strong assurance (via alignment evaluations) that the model will behave as intended.
I can’t think what “deploy” would mean other than “give users access to it”, so the paper appears to be making a pretty direct implication that users (and not just internal users or evaluators) are expected to have access to “highly dangerous” models. In fact that looks like it’s expected to be the normal case.
I don’t think people are expecting the models to be extremely useful without also developing dangerous capabilities.
That seems incompatible with the idea that no users would ever get access to dangerous models. If you were sure your models wouldn’t be useful without being dangerous, and you were committed to not allowing dangerous models to be used, then why would you even be doing any of this to begin with?
From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are[...etc...]
OK, but I’m responding to this paper and to the inferences people could reasonably draw from it, not to inside information.
And the list you give doesn’t give me the sense that anybody’s internalized the breadth and depth of things users could do to add capabilities. Giving the model access to all the tools you can think of gives you very little assurance about all the things somebody else might interconnect with the model in ways that would let it use them as tools. It also doesn’t deal with “your” model being used as a tool by something else. Possibly in a way that doesn’t look at all like how you expected it to be used. Nor with it interacting with outside entities in more complex ways than than the word “tool” tends to suggest.
As for the paper itself, it does seem to allude to some of that stuff, but then it ignores it.
That’s actually the big problem with most paradigms based on “red teaming” and “security auditing”, even for “normal” software. You want to be assured not only that the software will resist the specific attacks you happen to think of, but that it won’t misbehave no matter what anybody does, at least over a broader space of action you can possibly test. Just trying things out to see how the software responds is of minimal help there… which is why those sorts of activities aren’t primary assurance methods for regular software development. One of the scary things about ML is that the most of the things that are primary for other software don’t really work on it.
On fine tuning, it hadn’t even occurred to me that any user would be ever be given any ability to do any kind of training on the models. At least not in this generation. I can see that I had a blind spot there.
In the long term, though, the whole training-versus-inference distinction is a big drag on capability. A really capable system would extract information from everything it did or observed, and use that information thereafter, just as humans and animals do. If anybody figures out how to learn from experience the way humans do, with anything like the same kind of data economy, it’s going to be very hard to resist doing it. So eventually you have a very good chance that there’ll be systems that are constantly “fine tuning” themselves in unpredictable ways, and that get long-term memory of everything in the process. That’s what I was getting at when I mentioned the “shelf life” of architectural assumptions.
If I understood the paper correctly, by ‘stakeholder’ they most importantly mean government/regulators.
I think they also mentioned academics and maybe some others.
… which is exactly why I said that they didn’t seem to have a meaningful definition of what a “stakeholder” was. Talking about involving “stakeholders”, and then acting as though you’ve achieved that by involving regulators, academics, or whoever, is way too narrow and trivializes the literal meaning of the word “stakeholder”.
It feels a lot like talking about “alignment” and acting as though you’ve achieved it when your system doesn’t do the things on some ad-hoc checklist.
It also feels like falling into a common organizational pattern where the set of people tapped for “stakeholder involvement” is less like “people who are affected” and more like “people who can make trouble for us”.
No idea what you are referring to, I don’t see any mention in the paper of letting certain people safe access to a dangerous model (unless you’re talking about the evaluators?)
As I said, the paper more or less directly says that dangerous models will be deployed. And if you’re going to “know your customer”, or apply normal access controls, then you’re going to be picking people who have such access. But neither prior vetting nor surveillance is adequate.
Finally—I get the feeling that your writing is motivated by your negative outlook,
If you want go down that road, then I get the feeling that the paper we’re talking about, and a huge amount of other stuff besides, is motivated by a need to feel positive regardless of whether it make sense.
and not by trying to provide good analysis,
That’s pretty much meaningless and impossible to respond to.
concrete feedback,
The concrete feedback is that the kind of “evaluation” described in that paper, with the paper’s proposed ways of using the results, isn’t likely to be particularly effective for what it’s supposed to do, but could be a very effective tool for fooling yourself into thinking you’d “done enough”.
If you make that kind of approach the centerpiece of your safety system, or even a major pillar of it, then you are probably giving yourself a false sense of security, and you may be diverting energy better used elsewhere. Therefore you should not do that unless those are your goals.
or an alternative plan.
It’s a fallacy to respond to “that won’t work” with “well, what’s YOUR plan?”. My not having an alternative isn’t going to make anybody else’s approach work.
One alternative plan might be to quit building that stuff, erase what already exists, and disband those companies. If somebody comes up with a “real” safety strategy, you can always start again later. That approach is very unlikely to work, because somebody else will build whatever you would have… but it’s probably strictly better in terms of mean-time-before-disaster than coming up with rationalizations for going ahead.
Another alternative plan might be to quit worrying about it, so you’re happier.
I find it unhelpful.
… which is how I feel about the original paper we’re talking about. I read it as an attempt to feel more comfortable about a situation that’s intrinsically uncomfortable, because it’s intrinsically dangerous, maybe in an intrinsically unsolvable way. If comfort is the goal, then I guess it’s helpful, but if being right is the goal, then it’s unhelpful. If the comfort took the pressure off of somebody who might otherwise come up with a more effective safety approach, then it would be actively harmful… although I admit that I don’t see a whole lot of hope for that anyway.
I disagree with almost everything you wrote, here are some counter-arguments:
Both OpenAI and Anthropic have demonstrated that they have discipline to control at least when they deploy. GPT-4 was delayed to improve its alignment, and Claude was delayed purely to avoid accelerating OpenAI (I know this from talking to Anthropic employees). From talking to an ARC Evals employee, it definitely sounds like OpenAI and Anthropic are on board with giving as many resources as necessary to these dangerous evaluations, and are on board with stopping deployments if necessary.
I’m unsure if ‘selectively’ refers to privileged users, or the evaluators themselves. My understanding is that if the evaluators find the model dangerous, then no users will get access (I could be wrong about this). I agree that protecting the models from being stolen is incredibly important and is not trivial, but I expect that the companies will spend a lot of resources trying to prevent it (Dario Amodei in particular feels very strongly about investing in good security).
I don’t think people are expecting the models to be extremely useful without also developing dangerous capabilities.
Everyone is obviously aware that ‘alignment evals’ will be incredibly hard to do correctly, without risk of deceptive alignment. And preventing jailbreaks is very highly incentivized regardless of these alignment evals.
From talking to an ARC Evals employee, I know that they are doing a lot of work to ensure they have a buffer with regard to what the users can achieve. In particular, they are:
Letting the model use whatever tools might help it achieve dangerous outcomes (but in a controlled way)
Finetuning the models to be better at dangerous things (I believe that users won’t have finetuning access to the strongest models)
Running experiments to check if prompt engineering can achieve results similar to finetuning, or if finetuning will always be ahead
If I understood the paper correctly, by ‘stakeholder’ they most importantly mean government/regulators. Basically—if they achieve dangerous capabilities, it’s really good if the government knows, because it will inform regulation.
No idea what you are referring to, I don’t see any mention in the paper of letting certain people safe access to a dangerous model (unless you’re talking about the evaluators?)
That said, I don’t claim that everything is perfect and we’re all definitely going to be fine. Particularly, I agree that it will be hard or impossible to get everyone to follow this methodology, and I don’t yet see a good plan to enforce compliance. I’m also afraid of what will happen if we get stuck on not being able to confidently align a system that we’ve identified as dangerous (in this case it will get increasingly more likely that the model gets deployed anyway, or that other less compliant actors will achieve a dangerous model).
Finally—I get the feeling that your writing is motivated by your negative outlook, and not by trying to provide good analysis, concrete feedback, or an alternative plan. I find it unhelpful.
Good point. You’re right that they’ve delayed things. In fact, I get the impression that they’ve delayed for issues I personally wouldn’t even have worried about.
I don’t think that makes me believe that they will be able to refrain, permanently or even for a very long time, from doing anything they’ve really invested in, or anything they really see as critical to their ability to deliver what they’re selling. They haven’t demonstrated any really long delays, the pressure to do more is going to go nowhere but up, and organizational discipline tends to deteriorate over time even without increasing pressure. And, again, the paper’s already talking about things like “recommending” against deployment, and declining to analyze admittedly relevant capabilities like Web browsing… both of which seem like pretty serious signs of softening.
But they HAVE delayed things, and that IS undeniably something.
As I understand it, Anthropic was at least partially founded around worries about rushed deployment, so at a first guess I’d suspect Anthropic’s discipline would be last to fail. Which might mean that Anthropic would be first to fail commercially. Adverse selection...
It was meant to refer to being selective about users (mostly meaning “customerish” ones, not evaluators or developers). It was also meant to refer to being selective about which of the model’s intrinsic capabilities users can invoke and/or what they can ask it to do with those capabilities.
They talk about “strong information security controls”. Selective availability, in that sort of broad sense, is pretty much what that phrase means.
As for the specific issue of choosing the users, that’s a very, very standard control. And they talk about “monitoring” what users are doing, which only makes sense if you’re prepared to stop them from doing some things. That’s selectivity. Any user given access is privileged in the sense of not being one of the ones denied access, although to me the phrase “privileged user” tends to mean a user who has more access than the “average” one.
From page 4 of the paper:
I can’t think what “deploy” would mean other than “give users access to it”, so the paper appears to be making a pretty direct implication that users (and not just internal users or evaluators) are expected to have access to “highly dangerous” models. In fact that looks like it’s expected to be the normal case.
That seems incompatible with the idea that no users would ever get access to dangerous models. If you were sure your models wouldn’t be useful without being dangerous, and you were committed to not allowing dangerous models to be used, then why would you even be doing any of this to begin with?
OK, but I’m responding to this paper and to the inferences people could reasonably draw from it, not to inside information.
And the list you give doesn’t give me the sense that anybody’s internalized the breadth and depth of things users could do to add capabilities. Giving the model access to all the tools you can think of gives you very little assurance about all the things somebody else might interconnect with the model in ways that would let it use them as tools. It also doesn’t deal with “your” model being used as a tool by something else. Possibly in a way that doesn’t look at all like how you expected it to be used. Nor with it interacting with outside entities in more complex ways than than the word “tool” tends to suggest.
As for the paper itself, it does seem to allude to some of that stuff, but then it ignores it.
That’s actually the big problem with most paradigms based on “red teaming” and “security auditing”, even for “normal” software. You want to be assured not only that the software will resist the specific attacks you happen to think of, but that it won’t misbehave no matter what anybody does, at least over a broader space of action you can possibly test. Just trying things out to see how the software responds is of minimal help there… which is why those sorts of activities aren’t primary assurance methods for regular software development. One of the scary things about ML is that the most of the things that are primary for other software don’t really work on it.
On fine tuning, it hadn’t even occurred to me that any user would be ever be given any ability to do any kind of training on the models. At least not in this generation. I can see that I had a blind spot there.
In the long term, though, the whole training-versus-inference distinction is a big drag on capability. A really capable system would extract information from everything it did or observed, and use that information thereafter, just as humans and animals do. If anybody figures out how to learn from experience the way humans do, with anything like the same kind of data economy, it’s going to be very hard to resist doing it. So eventually you have a very good chance that there’ll be systems that are constantly “fine tuning” themselves in unpredictable ways, and that get long-term memory of everything in the process. That’s what I was getting at when I mentioned the “shelf life” of architectural assumptions.
I think they also mentioned academics and maybe some others.
… which is exactly why I said that they didn’t seem to have a meaningful definition of what a “stakeholder” was. Talking about involving “stakeholders”, and then acting as though you’ve achieved that by involving regulators, academics, or whoever, is way too narrow and trivializes the literal meaning of the word “stakeholder”.
It feels a lot like talking about “alignment” and acting as though you’ve achieved it when your system doesn’t do the things on some ad-hoc checklist.
It also feels like falling into a common organizational pattern where the set of people tapped for “stakeholder involvement” is less like “people who are affected” and more like “people who can make trouble for us”.
As I said, the paper more or less directly says that dangerous models will be deployed. And if you’re going to “know your customer”, or apply normal access controls, then you’re going to be picking people who have such access. But neither prior vetting nor surveillance is adequate.
If you want go down that road, then I get the feeling that the paper we’re talking about, and a huge amount of other stuff besides, is motivated by a need to feel positive regardless of whether it make sense.
That’s pretty much meaningless and impossible to respond to.
The concrete feedback is that the kind of “evaluation” described in that paper, with the paper’s proposed ways of using the results, isn’t likely to be particularly effective for what it’s supposed to do, but could be a very effective tool for fooling yourself into thinking you’d “done enough”.
If you make that kind of approach the centerpiece of your safety system, or even a major pillar of it, then you are probably giving yourself a false sense of security, and you may be diverting energy better used elsewhere. Therefore you should not do that unless those are your goals.
It’s a fallacy to respond to “that won’t work” with “well, what’s YOUR plan?”. My not having an alternative isn’t going to make anybody else’s approach work.
One alternative plan might be to quit building that stuff, erase what already exists, and disband those companies. If somebody comes up with a “real” safety strategy, you can always start again later. That approach is very unlikely to work, because somebody else will build whatever you would have… but it’s probably strictly better in terms of mean-time-before-disaster than coming up with rationalizations for going ahead.
Another alternative plan might be to quit worrying about it, so you’re happier.
… which is how I feel about the original paper we’re talking about. I read it as an attempt to feel more comfortable about a situation that’s intrinsically uncomfortable, because it’s intrinsically dangerous, maybe in an intrinsically unsolvable way. If comfort is the goal, then I guess it’s helpful, but if being right is the goal, then it’s unhelpful. If the comfort took the pressure off of somebody who might otherwise come up with a more effective safety approach, then it would be actively harmful… although I admit that I don’t see a whole lot of hope for that anyway.