Here is my guess on how shard theory would affect the argument in this post:
In my understanding, shard theory would predict that the model learns multiple goals from the training-compatible (TC) set (e.g. including both the coin goal and the go-right goal in CoinRun), and may pursue different learned goals in different new situations. The simplifying assumption that the model pursues a randomly chosen goal from the TC set also covers this case, so this doesn’t affect the argument.
Shard theory might also imply that the training-compatible set should be larger, e.g. including goals for which the agent’s behavior is not optimal. I don’t think this affects the argument, since we just need the TC set to satisfy the condition that permuting reward values in Sood will produce a reward vector that is still in the TC set.
So think that assuming shard theory in this post would lead to the same conclusions—would be curious if you disagree.
I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn’t offer any valid argument for that conclusion.
I don’t think these results cover the shard case. I don’t think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like “maximize time-discounted sum of a scalar quantity of world state.”
My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems.
Here is my guess on how shard theory would affect the argument in this post:
In my understanding, shard theory would predict that the model learns multiple goals from the training-compatible (TC) set (e.g. including both the coin goal and the go-right goal in CoinRun), and may pursue different learned goals in different new situations. The simplifying assumption that the model pursues a randomly chosen goal from the TC set also covers this case, so this doesn’t affect the argument.
Shard theory might also imply that the training-compatible set should be larger, e.g. including goals for which the agent’s behavior is not optimal. I don’t think this affects the argument, since we just need the TC set to satisfy the condition that permuting reward values in Sood will produce a reward vector that is still in the TC set.
So think that assuming shard theory in this post would lead to the same conclusions—would be curious if you disagree.
I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn’t offer any valid argument for that conclusion.
I don’t think these results cover the shard case. I don’t think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like “maximize time-discounted sum of a scalar quantity of world state.”
My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems.