I commented a portion of a copy of your power-seeking writeup.
I like the current doc a lot. I also feel that it seems to not consider some big formal hints and insights we’ve gotten from my work over the past two years.[1]
Very recently, I was able to show the following strong result:
Some researchers have speculated that capable reinforcement learning (RL) agents are often incentivized to seek resources and power in pursuit of their objectives. While seeking power in order to optimize a misspecified objective, agents might be incentivized to behave in undesirable ways, including rationally preventing deactivation and correction. Others have voiced skepticism: human power-seeking instincts seem idiosyncratic, and these urges need not be present in RL agents. We formalize a notion of power within the context of Markov decision processes (MDPs). We prove sufficient conditions for when optimal policies tend to seek power over the environment. For most prior beliefs one might have about the agent’s reward function, one should expect optimal agents to seek power by keeping a range of options available and, when the discount rate is sufficiently close to 1, by preferentially retaining access to more terminal states. In particular, these strategies are optimal for most reward functions.
I’d be interested in discussing this more with you; the result isn’t publicly available yet, but I’d be happy to take time to explain it. The result tells us significant things about generic optimal behavior in the finite MDP setting, and I think it’s worth deciphering these hints as best we can to apply them to more realistic training regimes.
For example, if you train a generally intelligent RL agent’s reward predictor network on a curriculum of small tasks, my theorems prove that these small tasks are probably best solved by seeking power within the task. This is an obvious way the agent might notice the power abstraction, become strategically aware, and start pursuing power instrumentally.
Another angle to consider is, how do power-seeking tendencies mirror other convergent phenomena, like convergent evolution, universal features, etc, and how does this inform our expectations about PS agent cognition? As in the MDP case, I suspect that similar symmetry-based considerations are at play.
For example, consider a DNN being trained on a computer vision task. These networks often learn edge detectors in early layers (this also occurs in the visual cortex). So fix the network architecture, data distribution, loss function, and the edge detector weights. Now consider a range of possible label distributions; for each, consider the network completion which minimizes expected loss. You could also consider the expected output of some learning algorithm, given that it finetunes the edge detector network for some number of steps. I predict that for a wide range of “reasonable” label distributions, these edge detectors promote effective loss minimization, more so than if these weights were randomly set. In this sense, having edge detectors is “empowering.”
And the reason this might crop up is that having these early features is a good idea for many possible symmetries of the label distribution. My thoughts here are very rough at this point in time, but I do feel like the analysis would benefit from highlighting parallels to convergent evolution, etc.
(This second angle is closer to conceptual alignment research than it is to weighing existing work, but I figured I’d mention it.)
Great writeup, and I’d love to chat some time if you’re available.
[1] My work has revolved around formally understanding power-seeking in a simple setting (finite MDPs) so as to inform analyses like these. Public posts include:
Thanks for reading, and for your comments on the doc. I replied to specific comments there, but at a high level: the formal work you’ve been doing on this does seem helpful and relevant (thanks for doing it!). And other convergent phenomena seem like helpful analogs to have in mind.
I commented a portion of a copy of your power-seeking writeup.
I like the current doc a lot. I also feel that it seems to not consider some big formal hints and insights we’ve gotten from my work over the past two years.[1]
Very recently, I was able to show the following strong result:
I’d be interested in discussing this more with you; the result isn’t publicly available yet, but I’d be happy to take time to explain it. The result tells us significant things about generic optimal behavior in the finite MDP setting, and I think it’s worth deciphering these hints as best we can to apply them to more realistic training regimes.
For example, if you train a generally intelligent RL agent’s reward predictor network on a curriculum of small tasks, my theorems prove that these small tasks are probably best solved by seeking power within the task. This is an obvious way the agent might notice the power abstraction, become strategically aware, and start pursuing power instrumentally.
Another angle to consider is, how do power-seeking tendencies mirror other convergent phenomena, like convergent evolution, universal features, etc, and how does this inform our expectations about PS agent cognition? As in the MDP case, I suspect that similar symmetry-based considerations are at play.
For example, consider a DNN being trained on a computer vision task. These networks often learn edge detectors in early layers (this also occurs in the visual cortex). So fix the network architecture, data distribution, loss function, and the edge detector weights. Now consider a range of possible label distributions; for each, consider the network completion which minimizes expected loss. You could also consider the expected output of some learning algorithm, given that it finetunes the edge detector network for some number of steps. I predict that for a wide range of “reasonable” label distributions, these edge detectors promote effective loss minimization, more so than if these weights were randomly set. In this sense, having edge detectors is “empowering.”
And the reason this might crop up is that having these early features is a good idea for many possible symmetries of the label distribution. My thoughts here are very rough at this point in time, but I do feel like the analysis would benefit from highlighting parallels to convergent evolution, etc.
(This second angle is closer to conceptual alignment research than it is to weighing existing work, but I figured I’d mention it.)
Great writeup, and I’d love to chat some time if you’re available.
[1] My work has revolved around formally understanding power-seeking in a simple setting (finite MDPs) so as to inform analyses like these. Public posts include:
Seeking Power is Often Robustly Instrumental in MDPs
The Catastrophic Convergence Conjecture
Review of ‘Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More’
Generalizing Power to multi-agent games
Non-Obstruction: A Simple Concept Motivating Corrigibility
Thanks for reading, and for your comments on the doc. I replied to specific comments there, but at a high level: the formal work you’ve been doing on this does seem helpful and relevant (thanks for doing it!). And other convergent phenomena seem like helpful analogs to have in mind.