Yes, I think this is right. It’s been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a “correct” model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.
We’ve focused on simple gridworlds here, partly because they’re visual, but also because they’re tractable. But I suspect there’s a mapping between POWER (in the RL context) and generalizability of features in NNs (in the context of something like the circuits work linked above). This would be really interesting to investigate.
Yes, I think this is right. It’s been pointed out elsewhere that feature universality in neural networks could be an instance of instrumental convergence, for example. And if you think about it, to the extent that a “correct” model of the universe exists, then capturing that world-model in your reasoning should be instrumentally useful for most non-trivial terminal goals.
We’ve focused on simple gridworlds here, partly because they’re visual, but also because they’re tractable. But I suspect there’s a mapping between POWER (in the RL context) and generalizability of features in NNs (in the context of something like the circuits work linked above). This would be really interesting to investigate.