inappropriate powerseeking: seeking to achieve empowerment of the ai over empowerment of others, in order to achieve adversarial peaks of the reward model, or etc; ie, “you asked for collaborative powerseeking and instead got deceptive alignment due to an adversarial hole in your model”. Some recent papers that try to formalize this in terms of RL:
(also, having adversarial holes in behavior makes the OSGT branch of concern look like “smart model reads the weights of your vulnerable model and pwns it” rather than any sort of agentically intentional cooperation.)
inappropriate powerseeking: seeking to achieve empowerment of the ai over empowerment of others, in order to achieve adversarial peaks of the reward model, or etc; ie, “you asked for collaborative powerseeking and instead got deceptive alignment due to an adversarial hole in your model”. Some recent papers that try to formalize this in terms of RL:
https://arxiv.org/pdf/2304.06528.pdf—Power-seeking can be probable and predictive for trained agents—see also the lesswrong post
https://arxiv.org/pdf/2206.13477.pdf—Parametrically Retargetable Decision-Makers Tend To Seek Power—see also the lesswrong post
Boundaries sequence is also relevant to this
(also, having adversarial holes in behavior makes the OSGT branch of concern look like “smart model reads the weights of your vulnerable model and pwns it” rather than any sort of agentically intentional cooperation.)