What are the reasons we might want to do adversarial training?
Here’s a rough taxonomy.
Your model has learnt a stupid heuristic instead of the thing you want
It has overfit to whatever particular dataset you trained on. It is best described as ‘trying’ to pursue this particular heuristic, and it appears to be doing well because of some arbitrary correlation in your dataset. For example, it is succeeding at image classification by attending to high-frequency noise in the image. Doing adversarial training is useful because it finds examples where this heuristic is uncorrelated with what you want. This allows you to spot the problem and to improve the model’s alignment. It doesn’t matter if the examples are particularly consequential failures; it just matters that they help the model learn the correct pattern. The adversarial examples you find are a ‘surprise’ to the model.
Your model is deceptively aligned
Your model is fairly well-described as trying to pursue some goal that is unaligned with you. It performs well on your training set because this is instrumentally useful. It plans to defect in particular situations where this is advantageous. Doing adversarial training is useful if you can use it to fool the model into thinking it’s in some situation where it should defect. This allows you to spot that it is deceptive, and maybe train this deception away. It doesn’t matter if the example involves the model causing a particularly bad thing to happen, it just matters that you are able to detect the deception. The adversarial examples you find are not a ‘surprise’ to the model—it was deliberately planning to defect in some situations.
Your model is generally aligned, but an adversary (or some extremely unlikely input) can trigger temporary misaligned behaviour or even permanent misalignment.
For example, an adversarial AI can present your model with some sequence of inputs that cause it to start ‘trying’ to help this AI instead of you. Your model ‘wants’ to avoid this and doesn’t know in advance which situations will induce this. Adversarial training is useful to find these situations and fix these vulnerabilities. It is important to find the most consequential failures—you care less about situations where the model is briefly unaligned but doesn’t do significant harm.
Your model is fully aligned under all circumstances, but sometimes does bad things because of incompetence.
For example, sometimes it crashes a car because of a failure to recognise another car coming towards it. Adversarial training is useful to find cases where this happens. It is important to find the most consequential failures—you care less about situations where the model is briefly incompetent but doesn’t do significant harm. The adversarial examples you find are a ‘surprise’ to the model.
Aside: defining alignment
One confusing edge case for distinguishing alignment failure from capabilities failure is if the model ‘knows’ some concept, but doesn’t ‘know’ that that was the concept you wanted.
E.g. suppose the model classifies someone getting shot with a gun as not an injury. There are 3 possibilities:
The model does not have a reasonable concept of injury that includes getting hit by a bullet, or doesn’t know that getting shot at would cause you to get hit by a bullet
The model has this concept of injury, and understands that it will occur, but ‘thinks’ you only care about injuries involving swords or knives because almost all injuries in your training data involved those
The model understands what you want, but is pursuing some other goal
I think the first possibility is definitely a capabilities failure, and the last is definitely an alignment failure, but I’m not sure how to categorize the middle one. The key test is something like: do we expect this problem to get better or worse as model capability increases? I’m inclined to say that we expect it to get better, because a more capable model will be better at guessing what humans want, and better at asking for clarification if uncertain.
Some thoughts on why adversarial training might be useful
What are the reasons we might want to do adversarial training?
Here’s a rough taxonomy.
Your model has learnt a stupid heuristic instead of the thing you want
It has overfit to whatever particular dataset you trained on. It is best described as ‘trying’ to pursue this particular heuristic, and it appears to be doing well because of some arbitrary correlation in your dataset. For example, it is succeeding at image classification by attending to high-frequency noise in the image. Doing adversarial training is useful because it finds examples where this heuristic is uncorrelated with what you want. This allows you to spot the problem and to improve the model’s alignment. It doesn’t matter if the examples are particularly consequential failures; it just matters that they help the model learn the correct pattern. The adversarial examples you find are a ‘surprise’ to the model.
Your model is deceptively aligned
Your model is fairly well-described as trying to pursue some goal that is unaligned with you. It performs well on your training set because this is instrumentally useful. It plans to defect in particular situations where this is advantageous. Doing adversarial training is useful if you can use it to fool the model into thinking it’s in some situation where it should defect. This allows you to spot that it is deceptive, and maybe train this deception away. It doesn’t matter if the example involves the model causing a particularly bad thing to happen, it just matters that you are able to detect the deception. The adversarial examples you find are not a ‘surprise’ to the model—it was deliberately planning to defect in some situations.
Your model is generally aligned, but an adversary (or some extremely unlikely input) can trigger temporary misaligned behaviour or even permanent misalignment.
For example, an adversarial AI can present your model with some sequence of inputs that cause it to start ‘trying’ to help this AI instead of you. Your model ‘wants’ to avoid this and doesn’t know in advance which situations will induce this. Adversarial training is useful to find these situations and fix these vulnerabilities. It is important to find the most consequential failures—you care less about situations where the model is briefly unaligned but doesn’t do significant harm.
Your model is fully aligned under all circumstances, but sometimes does bad things because of incompetence.
For example, sometimes it crashes a car because of a failure to recognise another car coming towards it. Adversarial training is useful to find cases where this happens. It is important to find the most consequential failures—you care less about situations where the model is briefly incompetent but doesn’t do significant harm. The adversarial examples you find are a ‘surprise’ to the model.
Aside: defining alignment
One confusing edge case for distinguishing alignment failure from capabilities failure is if the model ‘knows’ some concept, but doesn’t ‘know’ that that was the concept you wanted.
E.g. suppose the model classifies someone getting shot with a gun as not an injury. There are 3 possibilities:
The model does not have a reasonable concept of injury that includes getting hit by a bullet, or doesn’t know that getting shot at would cause you to get hit by a bullet
The model has this concept of injury, and understands that it will occur, but ‘thinks’ you only care about injuries involving swords or knives because almost all injuries in your training data involved those
The model understands what you want, but is pursuing some other goal
I think the first possibility is definitely a capabilities failure, and the last is definitely an alignment failure, but I’m not sure how to categorize the middle one. The key test is something like: do we expect this problem to get better or worse as model capability increases? I’m inclined to say that we expect it to get better, because a more capable model will be better at guessing what humans want, and better at asking for clarification if uncertain.