I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn’t offer any valid argument for that conclusion.
I don’t think these results cover the shard case. I don’t think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like “maximize time-discounted sum of a scalar quantity of world state.”
My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems.
I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn’t offer any valid argument for that conclusion.
I don’t think these results cover the shard case. I don’t think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like “maximize time-discounted sum of a scalar quantity of world state.”
My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems.