TurnTrout comments on Power-seeking can be probable and predictive for trained agents

TurnTrout 7 Jun 2023 1:27 UTC
LW: 2 AF: 2
0
AF
What part of the post you link rules this out? As far as I can tell, the thing you’re saying is that a few factors influence the decisions of the maze-solving agent, which isn’t incompatible with the agent acting optimally with respect to some reward function such that it produces training-reward-optimal behaviour on the training set.
In addition to my other comment, I’ll further quote Behavioural statistics for a maze-solving agent:
We think the complex influence of spatial distances on the network’s decision-making might favor a ‘shard-like’ description: a description of the network’s decisions as coalitions between heuristic submodules whose voting-power varies based on context. While this is still an underdeveloped hypothesis, it’s motivated by two lines of thinking.
First, we weakly suspect that the agent may be systematically dynamically inconsistent from a utility-theoretic perspective. That is, the effects of $d_{step} (mouse, cheese)$ and (potentially) $d_{Euclidean} (cheese, top-right)$ might turn out to call for a behavior model where the agent’s priorities in a given maze change based on the agent’s current location.
Second, we suspect that if the agent is dynamically consistent, a shard-like description may allow for a more compact and natural statement of an otherwise very gerrymandered-sounding utility function that fixes the value of cheese and top-right in a maze based on a “strange” mixture of maze properties. It may be helpful to look at these properties in terms of similarities to the historical activation conditions of different submodules that favor different plans.
While we consider our evidence suggestive in these directions, it’s possible that some simple but clever utility function will turn out to be predictively successful. For example, consider our two strongly observed effects: $d_{Euclidean} (cheese, top-right)$ and $d_{step} (decision-square, cheese)$ . We might explain these effects by stipulating that:
- On each turn, the agent receives value inverse to the agent’s distance from the top-right,
- Sharing a square with the cheese adds constant value,
- The agent doesn’t know that getting to the cheese ends the game early, and
- The agent time-discounts.
We’re somewhat skeptical that models of this kind will hold up once you crunch the numbers and look at scenario-predictions, but they deserve a fair shot.
We hope to revisit these questions rigorously when our mechanistic understanding of the network has matured.