Note that ARC evals haven’t done anything I would describe as “try to investigate misalignment in the lab.” They’ve asked AI systems to carry out tasks to understand what they are capable of, e.g. can they make copies of themselves or carry out a targeted phishing attack.
However I also think “create misalignment in the lab” is super important for doing real evaluation of takeover risks. I think the risks are small and the cost-benefit analysis is a slam dunk. I think it’s great to actually have the cost-benefit discussion (e.g. by considering concrete ways in which an experimental setup increases risk and evaluating them), but that a knee-jerk reaction of “misalignment = bad” would be unproductive. Misalignment in the lab often won’t increase risk at all (or will have a truly negligible effect) while being more-or-less necessary for any institutionally credible alignment strategy.
There are coherent worldviews where I can see the cost-benefit coming out against, but they involve having a very low total level of risk (such that this problem is unlikely to appear unless you deliberately create it) together with a very high level of civilizational vulnerability (such that a small amount of misaligned AI in the lab can cause a global catastrophe). Or maybe more realistically just a claim that studying alignment in the lab has very small probability of helping mitigate it.
Nitpick: When you fine-tune a model to try to carry out phishing attacks, how is that not deliberately creating a misaligned model? Because it’s doing what you wanted it to do? Well in that case it’s impossible to deliberately create a misaligned model, no?
Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don’t think “refuses to carry out phishing attacks when asked” is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn’t part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)
I think relevant forms of “misalignment in the lab” are more like: we created a situation where the model does what the (simulated) human wants during training, but then abruptly switches to overpowering the simulated human once it has the opportunity to do so. Or we created a situation where a model really wants to do X, is fine-tuned to do Y, and then abruptly switches to pursuing X once it thinks it is no longer being trained. Or we created an AI that deliberately tries to answer questions dishonestly and tested whether we were able to catch it. Or so on.
Note that ARC evals haven’t done anything I would describe as “try to investigate misalignment in the lab.” They’ve asked AI systems to carry out tasks to understand what they are capable of, e.g. can they make copies of themselves or carry out a targeted phishing attack.
However I also think “create misalignment in the lab” is super important for doing real evaluation of takeover risks. I think the risks are small and the cost-benefit analysis is a slam dunk. I think it’s great to actually have the cost-benefit discussion (e.g. by considering concrete ways in which an experimental setup increases risk and evaluating them), but that a knee-jerk reaction of “misalignment = bad” would be unproductive. Misalignment in the lab often won’t increase risk at all (or will have a truly negligible effect) while being more-or-less necessary for any institutionally credible alignment strategy.
There are coherent worldviews where I can see the cost-benefit coming out against, but they involve having a very low total level of risk (such that this problem is unlikely to appear unless you deliberately create it) together with a very high level of civilizational vulnerability (such that a small amount of misaligned AI in the lab can cause a global catastrophe). Or maybe more realistically just a claim that studying alignment in the lab has very small probability of helping mitigate it.
Nitpick: When you fine-tune a model to try to carry out phishing attacks, how is that not deliberately creating a misaligned model? Because it’s doing what you wanted it to do? Well in that case it’s impossible to deliberately create a misaligned model, no?
Yeah, I would say a model that carries out test phishing attacks (when asked to do so) is not misaligned in the relevant sense I don’t think “refuses to carry out phishing attacks when asked” is part of the definition of alignment. (Also note that fine-tuning models to be more willing to carry out phishing isn’t part of the evaluations ARC has done so far, the models are just jailbroken in the normal way for the small number of tasks that require it.)
I think relevant forms of “misalignment in the lab” are more like: we created a situation where the model does what the (simulated) human wants during training, but then abruptly switches to overpowering the simulated human once it has the opportunity to do so. Or we created a situation where a model really wants to do X, is fine-tuned to do Y, and then abruptly switches to pursuing X once it thinks it is no longer being trained. Or we created an AI that deliberately tries to answer questions dishonestly and tested whether we were able to catch it. Or so on.