TurnTrout comments on Power-seeking can be probable and predictive for trained agents

TurnTrout 5 Jun 2023 21:43 UTC
LW: 4 AF: 3
0
AF
I think you’re the one who’s imposing a type error here. For “value functions” to be useful in modelling a policy, it doesn’t have to be the case that the policy is acting optimally with respect to a suggestively-labeled critic—it just has to be the case that the agent is acting consistently with some value function.
Can you say more? Maybe give an example of what this looks like in the maze-solving regime?
What part of the post you link rules this out? As far as I can tell, the thing you’re saying is that a few factors influence the decisions of the maze-solving agent, which isn’t incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.
This is a fair question, because I left a lot to the reader. I’ll clarify now.
I was not claiming that you can’t, after the fact, rationalize observed behavior using the extremely flexible reward-maximization framework.
I was responding to the specific claim of assuming internal representation of a ‘training-compatible’ reward function. In evaluating this claim, we shouldn’t just see whether this claim is technically compatible with empirical results, but we should instead reason probabilistically. How strongly does this claim predict observed data, relative to other models of policy formation?
In the maze setting, the cheese was always in the top-right 5x5 corner. The reward was sparse and only used to update the network when the mouse hit the cheese. The “training compatible goal set” is unconstrained on the test set. An example element might agree with the training reward on the training distribution, and then outside of the training distribution, assign 1 reward iff the mouse is on the bottom-left square.
The vast majority of such unconstrained functions will not involve pursuing cheese reliably across levels, and most of these reward functions will not be optimized by going to the top-right part of the maze. So this “training-compatible” hypothesis barely assigns any probability to the observed generalization of the network.
However, other hypotheses—like “the policy develops motivations related to obvious correlates of its historical reinforcement signals”^[1] -- predict things like “the policy tends to go to the top-right 5x5, and searches for cheese more strongly once there.” I registered such a prediction before seeing any of the generalization behavior. This hypothesis assigns high probability to the observed results.
So this paper’s assumption is simply losing out in a predictive sense, and that’s what I was critiquing. One can nearly always rationalize behavior as optimizing some reward function which you come up with after the fact. But if you want to predict generalization ahead of time, you shouldn’t use this assumption in your reasoning.
Second, I think the network does not internally represent and optimize a reward function. I think that this representation claim is in some (but not total and undeniable) tension with our interpretability results. I am willing to take bets against you on the internal structure of the maze-solving nets.
1. ^
  You might respond “but this is informal.” Yes. My answer is that it’s better to be informal and right than to be formal and wrong.
- Vika 6 Jun 2023 15:56 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Thanks Daniel for the detailed response (which I agree with), and thanks Alex for the helpful clarification.
  I agree that the training-compatible set is not predictive for how the neural network generalizes (at least under the “strong distributional shift” assumption in this post where the test set is disjoint from the training set, which I think could be weakened in future work). The point of this post is that even though you can’t generally predict behavior in new situations based on the training-compatible set alone, you can still predict power-seeking tendencies. That’s why the title says “power-seeking can be predictive” not “training-compatible goals can be predictive”.
  The hypothesis you mentioned seems compatible with the assumptions of this post. When you say “the policy develops motivations related to obvious correlates of its historical reinforcement signals”, these “motivations” seem like a kind of training-compatible goals (if defined more broadly than in this post). I would expect that a system that pursues these motivations in new situations would exhibit some power-seeking tendencies because those correlate with a lot of reinforcement signals.
  I suspect a lot of the disagreement here comes from different interpretations of the “internal representations of goals” assumption, I will try to rephrase that part better.
  - TurnTrout 12 Jun 2023 19:01 UTC
    LW: 4 AF: 3
    0
    AF Parent
    That’s why the title says “power-seeking can be predictive” not “training-compatible goals can be predictive”.
    You’re right. I was critiquing “power-seeking due to your assumptions isn’t probable, because I think your assumptions won’t hold” and not “power-seeking isn’t predictive.” I had misremembered the predictive/probable split, as introduced in Definitions of “objective” should be Probable and Predictive:
    I don’t see a notion of “objective” that can be confidently claimed is:
    Probable: there is a good argument that the systems we build will have an “objective”, and
    Predictive: If I know that a system has an “objective”, and I know its behavior on a limited set of training data, I can predict significant aspects of the system’s behavior in novel situations (e.g. whether it will execute a treacherous turn once it has the ability to do so successfully).
    Sorry for the confusion. I agree that power-seeking is predictive given your assumptions. I disagree that power-seeking is probable due to your assumptions being probable. The argument I gave above was actually:
    The assumptions used in the post (“learns a randomly-selected training-compatible goal”) assign low probability to experimental results, relative to other predictions which I generated (and thus relative to other ways of reasoning about generalization),
    Therefore the assumptions become less probable
    Therefore power-seeking becomes less probable (at least, due to these specific assumptions becoming less probable; I still think P(power-seeking) is reasonably large)
    I suspect that you agree that “learns a training-compatible goal” isn’t very probable/realistic. My point is then that the conclusions of the current work are weakened; maybe now more work has to go into the “can” in “Power-seeking can be probable and predictive.”
- Vika 6 Jun 2023 19:47 UTC
  LW: 2 AF: 1
  0
  AF Parent
  The issue with being informal is that it’s hard to tell whether you are right. You use words like “motivations” without defining what you mean, and this makes your statements vague enough that it’s not clear whether or how they are in tension with other claims. (E.g. what I have read so far doesn’t seems to rule out that shards can be modeled as contextually activated subagents with utility functions.)
  An upside of formalism is that you can tell when it’s wrong, and thus it can help make our thinking more precise even if it makes assumptions that may not apply. I think defining your terms and making your arguments more formal should be a high priority. I’m not advocating spending hundreds of hours proving theorems, but moving in the direction of formalizing definitions and claims would be quite valuable.
  It seems like a bad sign that the most clear and precise summary of shard theory claims was written by someone outside your team. I highly agree with this takeaway from that post: “Making a formalism for shard theory (even one that’s relatively toy) would probably help substantially with both communicating key ideas and also making research progress.” This work has a lot of research debt, and paying it off would really help clarify the disagreements around these topics.
  - TurnTrout 12 Jun 2023 18:46 UTC
    LW: 4 AF: 3
    0
    AF Parent
    The issue with being informal is that it’s hard to tell whether you are right. You use words like “motivations” without defining what you mean, and this makes your statements vague enough that it’s not clear whether or how they are in tension with other claims.
    It seems worth pointing out: the informality is in the hypothesis, which comprises a set of somewhat illegible intuitions and theories I use to reason about generalization. However, the prediction itself is what needs to be graded in order to see whether I was right. I made a prediction fairly like “the policy tends to go to the top-right 5x5, and searches for cheese once there, because that’s where the cheese-seeking computations were more strongly historically reinforced” and “the policy sometimes pursues cheese and sometimes navigates to the top-right 5x5 corner.” These predictions are (informally) gradable, even if the underlying intuitions are informal.
    As it pertains to shard theory more broadly, though, I agree that more precision is needed. Increasing precision and formalism is the reason I proposed and executed the project underpinning Understanding and controlling a maze-solving policy network. I wanted to understand more about realistic motivational circuitry and model internals in the real world. I think the last few months have given me headway on a more mechanistic definition of a “shard-based agent.”