DanielFilan comments on Power-seeking can be probable and predictive for trained agents

DanielFilan 6 Jun 2023 19:17 UTC
LW: 4 AF: 3
2
AF
The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are “internally represented” isn’t necessary for those results. You’re right that the post uses the phrase “internal representation” once at the start, and some very weak form of “representation” is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn’t mean that they’re central to the post.
- Vika 6 Jun 2023 19:27 UTC
  LW: 6 AF: 5
  0
  AF Parent
  Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like “We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts”.
  - TurnTrout 12 Jun 2023 18:43 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Great, this sounds much better!