TurnTrout comments on Power-seeking can be probable and predictive for trained agents

TurnTrout 5 Jun 2023 21:39 UTC
LW: 2 AF: 2
0
AF
To be fair, the post sort of makes this mistake by talking about “internal representations”, but I think everything goes thru if you strike out that talk.
I’m responding to this post, so why should I strike that out?
The utility function formalism doesn’t require agents to “internally represent a scalar function over observations”. You’ll notice that this isn’t one of the conclusions of the VNM theorem.
The post is talking about internal representations.
- DanielFilan 6 Jun 2023 19:17 UTC
  LW: 4 AF: 3
  2
  AF Parent
  The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are “internally represented” isn’t necessary for those results. You’re right that the post uses the phrase “internal representation” once at the start, and some very weak form of “representation” is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn’t mean that they’re central to the post.
  - Vika 6 Jun 2023 19:27 UTC
    LW: 6 AF: 5
    0
    AF Parent
    Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like “We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts”.
    - TurnTrout 12 Jun 2023 18:43 UTC
      LW: 4 AF: 3
      0
      AF Parent
      Great, this sounds much better!
- Vika 6 Jun 2023 14:45 UTC
  LW: 2 AF: 1
  0
  AF Parent
  The internal representations assumption was meant to be pretty broad, I didn’t mean that the network is explicitly representing a scalar reward function over observations or anything like that—e.g. these can be implicit representations of state features $.$ I think this would also include the kind of representations you are assuming in the maze-solving post, e.g. cheese shards / circuits.