To be fair, the post sort of makes this mistake by talking about “internal representations”, but I think everything goes thru if you strike out that talk.
I’m responding to this post, so why should I strike that out?
The utility function formalism doesn’t require agents to “internally represent a scalar function over observations”. You’ll notice that this isn’t one of the conclusions of the VNM theorem.
The post is talking about internal representations.
The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are “internally represented” isn’t necessary for those results. You’re right that the post uses the phrase “internal representation” once at the start, and some very weak form of “representation” is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn’t mean that they’re central to the post.
Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like “We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts”.
The internal representations assumption was meant to be pretty broad, I didn’t mean that the network is explicitly representing a scalar reward function over observations or anything like that—e.g. these can be implicit representations of state features. I think this would also include the kind of representations you are assuming in the maze-solving post, e.g. cheese shards / circuits.
I’m responding to this post, so why should I strike that out?
The post is talking about internal representations.
The core claim of this post is that if you train a network in some environment, the agent will not generalize optimally with respect to the reward function you trained it on, but will instead be optimal with respect to some other reward function in a way that is compatible with training-reward-optimality, and that this means that it is likely to avoid shutdown in new environments. The idea that this happens because reward functions are “internally represented” isn’t necessary for those results. You’re right that the post uses the phrase “internal representation” once at the start, and some very weak form of “representation” is presumably necessary for the policy to be optimal for a reward function (at least in the sense that you can derive a bunch of facts about a reward function from the optimal policy for that reward function), but that doesn’t mean that they’re central to the post.
Thanks Daniel, this is a great summary. I agree that internal representation of the reward function is not load-bearing for the claim. The weak form of representation that you mentioned is what I was trying to point at. I will rephrase the sentence to clarify this, e.g. something like “We assume that the agent learns a goal during the training process: some form of implicit internal representation of desired state features or concepts”.
Great, this sounds much better!
The internal representations assumption was meant to be pretty broad, I didn’t mean that the network is explicitly representing a scalar reward function over observations or anything like that—e.g. these can be implicit representations of state features. I think this would also include the kind of representations you are assuming in the maze-solving post, e.g. cheese shards / circuits.