the gears to ascension comments on Why should we expect AIs to coordinate well?

the gears to ascension 14 Feb 2023 19:48 UTC
4 points
1
Agreed that the previous expectations don’t hold up. I’m worried about adversarial examples tuned to make less adversarially resistant reasoners cooperate with a smart one; the adversarial examples could include things like a bunch of natural language assertions that everything is fine, exploiting goal misgeneralization on the parts of the victims.
In other words, I expect the same kinds of attacks that are used to get humans to follow and obey commands of a leader who doesn’t have their interests at heart.