TurnTrout comments on Power-seeking can be probable and predictive for trained agents

TurnTrout Jun 1, 2023, 3:54 PM
LW: 5 AF: 3
−1
AF
I still expect instrumental convergence from agentic systems with shard-encoded goals, but think this post doesn’t offer any valid argument for that conclusion.
I don’t think these results cover the shard case. I don’t think reward functions are good ways of describing goals in settings I care about. I also think that realistic goal pursuit need not look like “maximize time-discounted sum of a scalar quantity of world state.”
My point is not that instrumental convergence is wrong, or that shard theory makes different predictions. I just think that these results are not predictive of trained systems.