My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I’m playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I’ll lose the game. I predict that I’ll lose, even though I can’t predict my opponent’s (optimal) moves—otherwise I’d probably be that good myself.
My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even without saying what particular actions these policies take to get there. I may not even be able to compute a single optimal policy for a single non-trivial objective, but I can still reason about the statistical tendencies of optimal policies.
if I’m playing chess against an opponent who plays the optimal policy for the chess objective function
1. I predict that you will never encounter such an opponent. Solving chess is hard.*
2. Optimal play within a game might not be optimal overall (others can learn from the strategy).
Why does this matter? If the theorems hold, even for ‘not optimal, but still great’ policies (say, for chess), then the distinction is irrelevant. Though for more complicated (or non-zero sum) games, the optimal move/policy may depend on the other player’s move/policy.
(I’m not sure what ‘avoid shutdown’ looks like in chess.)
ETA:
*with 10^43 legal positions in chess, it will take an impossibly long time to compute a perfect strategy with any feasible technology.
My power-seeking theorems seem a bit like Vingean reflection. In Vingean reflection, you reason about an agent which is significantly smarter than you: if I’m playing chess against an opponent who plays the optimal policy for the chess objective function, then I predict that I’ll lose the game. I predict that I’ll lose, even though I can’t predict my opponent’s (optimal) moves—otherwise I’d probably be that good myself.
My power-seeking theorems show that most objectives have optimal policies which e.g. avoid shutdown and survive into the far future, even without saying what particular actions these policies take to get there. I may not even be able to compute a single optimal policy for a single non-trivial objective, but I can still reason about the statistical tendencies of optimal policies.
1. I predict that you will never encounter such an opponent. Solving chess is hard.*
2. Optimal play within a game might not be optimal overall (others can learn from the strategy).
Why does this matter? If the theorems hold, even for ‘not optimal, but still great’ policies (say, for chess), then the distinction is irrelevant. Though for more complicated (or non-zero sum) games, the optimal move/policy may depend on the other player’s move/policy.
(I’m not sure what ‘avoid shutdown’ looks like in chess.)
ETA:
*with 10^43 legal positions in chess, it will take an impossibly long time to compute a perfect strategy with any feasible technology.
-source: https://en.wikipedia.org/wiki/Chess#Mathematics which lists its source from 1977