Two quick thoughts (that don’t engage deeply with this nice post).
I’m worried in some cases where the goal is not consistent across situations. For example, if prompted to pursue some goal, it then does it seriously with convergent instrumental goals.
I think it seems pretty likely that future iterations of transformers will have bits of powerful search in them, but people who seem very worried about that search seem to think that once that search is established enough, gradient descent will cause the internals of the model to be organised mostly around that search (I imagine the search circuits “bubbling out” to be the outer structure of the learned algorithm). Probably this is all just conceptually confused, but to the extent it’s not, I’m pretty surprised by their intuition.
Two quick thoughts (that don’t engage deeply with this nice post).
I’m worried in some cases where the goal is not consistent across situations. For example, if prompted to pursue some goal, it then does it seriously with convergent instrumental goals.
I think it seems pretty likely that future iterations of transformers will have bits of powerful search in them, but people who seem very worried about that search seem to think that once that search is established enough, gradient descent will cause the internals of the model to be organised mostly around that search (I imagine the search circuits “bubbling out” to be the outer structure of the learned algorithm). Probably this is all just conceptually confused, but to the extent it’s not, I’m pretty surprised by their intuition.