The idea of deliberately training misaligned models in order to check whether your techniques work is a great one IMO. Note that this is not a new view; ARC evals and others in the AI safety community (such as myself) have been saying we should do stuff like this for a while, and in some cases actually doing it (e.g. fine-tuning a model to exhibit some dangerous behavior). Of course it’s not something that should be done lightly, but it’s much better than the alternative, in which we have a growing toolbox of alignment & interpretability techniques but we aren’t allowed to test whether they work & we don’t know whether our models are capable enough to need them.
Note that ARC evals haven’t done anything I would describe as “try to investigate misalignment in the lab.” They’ve asked AI systems to carry out tasks to understand what they are capable of, e.g. can they make copies of themselves or carry out a targeted phishing attack.
However I also think “create misalignment in the lab” is super important for doing real evaluation of takeover risks. I think the risks are small and the cost-benefit analysis is a slam dunk. I think it’s great to actually have the cost-benefit discussion (e.g. by considering concrete ways in which an experimental setup increases risk and evaluating them), but that a knee-jerk reaction of “misalignment = bad” would be unproductive. Misalignment in the lab often won’t increase risk at all (or will have a truly negligible effect) while being more-or-less necessary for any institutionally credible alignment strategy.
There are coherent worldviews where I can see the cost-benefit coming out against, but they involve having a very low total level of risk (such that this problem is unlikely to appear unless you deliberately create it) together with a very high level of civilizational vulnerability (such that a small amount of misaligned AI in the lab can cause a global catastrophe). Or maybe more realistically just a claim that studying alignment in the lab has very small probability of helping mitigate it.
Nitpick: When you fine-tune a model to try to carry out phishing attacks, how is that not deliberately creating a misaligned model? Because it’s doing what you wanted it to do? Well in that case it’s impossible to deliberately create a misaligned model, no?
Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don’t think “refuses to carry out phishing attacks when asked” is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn’t part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)
I think relevant forms of “misalignment in the lab” are more like: we created a situation where the model does what the (simulated) human wants during training, but then abruptly switches to overpowering the simulated human once it has the opportunity to do so. Or we created a situation where a model really wants to do X, is fine-tuned to do Y, and then abruptly switches to pursuing X once it thinks it is no longer being trained. Or we created an AI that deliberately tries to answer questions dishonestly and tested whether we were able to catch it. Or so on.
Um, what difference are you imagining between this thing and that thing?
Like, I think there is a case that you can do gain of function research safely, but “we tried to make it more capable at a particular bad thing” seems like a pretty central example of gain-of-function research?
S-risks don’t happen by default; misaligned AGI does. The chance of lab escape of misaligned AGI and deliberately-s-risky AGI is the same (let’s assume) but this chance is tiny compared to the benefits (in the case of testing our alignment and interpretability techniques) and huge compared to the benefits (in the case of testing our s-risk-reduction techniques).
Analogy to bio gain of function: The reason why it’s a bad idea to create novel pathogens in a lab is that they probably won’t appear at all if you don’t create them. If we were instead in a world where giant tech companies were busy creating novel pathogens designed specifically to kill as many people as possible, and injecting them into the water supply, then yes, it would make sense to do gain of function research in a desperate attempt to devise countermeasures in the brief window before someone else does it anyway.
Is this your position: there is no acceptable reason to deliberately optimize for s-risky things like sadism. And doing so to red-team s-risk detection is obviously madness. But possibly red-teaming conventional misalignment which would simply kill everyone if the absolute worst happened and is the default anyway maybe makes some sense?
It’s worth noting that you don’t necessarily need to train models to actually do dangerous actions like literally executing on a takeover attempt, you can just train models which do something which is a proxy to coups (or a proxy to some part of coups).
The extent to which this proxy itself dangerous or generalizing in a dangerous way will vary.
Like suppose I want to see if my oversight process discovers when AIs try to insert malign backdoor for the purposes of takeover. I can just train an AI to insert bugs which cause some specific observable effect later (e.g., the code should delete some specific file in some circumstance) and this observable effect doesn’t itself need to be dangerous.
Yeah, I think I remember hearing about ARC doing this a while ago too, and disliked it then, and similarly dislike it now. Suppose they make a misaligned model, and their control systems fail so that the model can spread very far. I expect their unconstrained misaligned model can do far more damage than their constrained possibly aligned ones if able to spread freely on the internet. Probably being an existential risk itself.
Edit: Man, I kinda dislike my past comment. Like listening to or watching a recording of yourself. But I stick with the lab-leak concern.
The idea of deliberately training misaligned models in order to check whether your techniques work is a great one IMO. Note that this is not a new view; ARC evals and others in the AI safety community (such as myself) have been saying we should do stuff like this for a while, and in some cases actually doing it (e.g. fine-tuning a model to exhibit some dangerous behavior). Of course it’s not something that should be done lightly, but it’s much better than the alternative, in which we have a growing toolbox of alignment & interpretability techniques but we aren’t allowed to test whether they work & we don’t know whether our models are capable enough to need them.
Note that ARC evals haven’t done anything I would describe as “try to investigate misalignment in the lab.” They’ve asked AI systems to carry out tasks to understand what they are capable of, e.g. can they make copies of themselves or carry out a targeted phishing attack.
However I also think “create misalignment in the lab” is super important for doing real evaluation of takeover risks. I think the risks are small and the cost-benefit analysis is a slam dunk. I think it’s great to actually have the cost-benefit discussion (e.g. by considering concrete ways in which an experimental setup increases risk and evaluating them), but that a knee-jerk reaction of “misalignment = bad” would be unproductive. Misalignment in the lab often won’t increase risk at all (or will have a truly negligible effect) while being more-or-less necessary for any institutionally credible alignment strategy.
There are coherent worldviews where I can see the cost-benefit coming out against, but they involve having a very low total level of risk (such that this problem is unlikely to appear unless you deliberately create it) together with a very high level of civilizational vulnerability (such that a small amount of misaligned AI in the lab can cause a global catastrophe). Or maybe more realistically just a claim that studying alignment in the lab has very small probability of helping mitigate it.
Nitpick: When you fine-tune a model to try to carry out phishing attacks, how is that not deliberately creating a misaligned model? Because it’s doing what you wanted it to do? Well in that case it’s impossible to deliberately create a misaligned model, no?
Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don’t think “refuses to carry out phishing attacks when asked” is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn’t part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)
I think relevant forms of “misalignment in the lab” are more like: we created a situation where the model does what the (simulated) human wants during training, but then abruptly switches to overpowering the simulated human once it has the opportunity to do so. Or we created a situation where a model really wants to do X, is fine-tuned to do Y, and then abruptly switches to pursuing X once it thinks it is no longer being trained. Or we created an AI that deliberately tries to answer questions dishonestly and tested whether we were able to catch it. Or so on.
I do sometimes worry that humanity is dumb enough to create “gain of function for s-risks” and call it alignment research.
I agree, that would be terrible & I wouldn’t endorse that.
Um, what difference are you imagining between this thing and that thing?
Like, I think there is a case that you can do gain of function research safely, but “we tried to make it more capable at a particular bad thing” seems like a pretty central example of gain-of-function research?
S-risks don’t happen by default; misaligned AGI does. The chance of lab escape of misaligned AGI and deliberately-s-risky AGI is the same (let’s assume) but this chance is tiny compared to the benefits (in the case of testing our alignment and interpretability techniques) and huge compared to the benefits (in the case of testing our s-risk-reduction techniques).
Analogy to bio gain of function: The reason why it’s a bad idea to create novel pathogens in a lab is that they probably won’t appear at all if you don’t create them. If we were instead in a world where giant tech companies were busy creating novel pathogens designed specifically to kill as many people as possible, and injecting them into the water supply, then yes, it would make sense to do gain of function research in a desperate attempt to devise countermeasures in the brief window before someone else does it anyway.
Is this your position: there is no acceptable reason to deliberately optimize for s-risky things like sadism. And doing so to red-team s-risk detection is obviously madness. But possibly red-teaming conventional misalignment which would simply kill everyone if the absolute worst happened and is the default anyway maybe makes some sense?
I’m not sure what you are getting at. Maybe? What you said is not quite how I’d put it, but seems similar at least.
It’s worth noting that you don’t necessarily need to train models to actually do dangerous actions like literally executing on a takeover attempt, you can just train models which do something which is a proxy to coups (or a proxy to some part of coups).
The extent to which this proxy itself dangerous or generalizing in a dangerous way will vary.
Like suppose I want to see if my oversight process discovers when AIs try to insert malign backdoor for the purposes of takeover. I can just train an AI to insert bugs which cause some specific observable effect later (e.g., the code should delete some specific file in some circumstance) and this observable effect doesn’t itself need to be dangerous.
Yeah, I think I remember hearing about ARC doing this a while ago too, and disliked it then, and similarly dislike it now. Suppose they make a misaligned model, and their control systems fail so that the model can spread very far. I expect their unconstrained misaligned model can do far more damage than their constrained possibly aligned ones if able to spread freely on the internet. Probably being an existential risk itself.
Edit: Man, I kinda dislike my past comment. Like listening to or watching a recording of yourself. But I stick with the lab-leak concern.