the gears to ascension comments on Would we even want AI to solve all our problems?

the gears to ascension 23 Apr 2023 3:02 UTC
3 points
0
- inappropriate powerseeking: seeking to achieve empowerment of the ai over empowerment of others, in order to achieve adversarial peaks of the reward model, or etc; ie, “you asked for collaborative powerseeking and instead got deceptive alignment due to an adversarial hole in your model”. Some recent papers that try to formalize this in terms of RL:
  - https://arxiv.org/pdf/2304.06528.pdf—Power-seeking can be probable and predictive for trained agents—see also the lesswrong post
  - https://arxiv.org/pdf/2206.13477.pdf—Parametrically Retargetable Decision-Makers Tend To Seek Power—see also the lesswrong post
  - Boundaries sequence is also relevant to this
  - (also, having adversarial holes in behavior makes the OSGT branch of concern look like “smart model reads the weights of your vulnerable model and pwns it” rather than any sort of agentically intentional cooperation.)