I’d especially read footnote 3, because it gave me a very important observation for why instrumental convergence is actually bad for capabilities, or at least not obviously good for capabilities and incentivized, especially with a lot of space to roam:
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don’t know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from “complete failure” to “insane useless crap.” It’s notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
I’d especially read footnote 3, because it gave me a very important observation for why instrumental convergence is actually bad for capabilities, or at least not obviously good for capabilities and incentivized, especially with a lot of space to roam: