Agreed that the previous expectations don’t hold up. I’m worried about adversarial examples tuned to make less adversarially resistant reasoners cooperate with a smart one; the adversarial examples could include things like a bunch of natural language assertions that everything is fine, exploiting goal misgeneralization on the parts of the victims.
In other words, I expect the same kinds of attacks that are used to get humans to follow and obey commands of a leader who doesn’t have their interests at heart.
Agreed that the previous expectations don’t hold up. I’m worried about adversarial examples tuned to make less adversarially resistant reasoners cooperate with a smart one; the adversarial examples could include things like a bunch of natural language assertions that everything is fine, exploiting goal misgeneralization on the parts of the victims.
In other words, I expect the same kinds of attacks that are used to get humans to follow and obey commands of a leader who doesn’t have their interests at heart.