By default, I expect mesa-optimizers to use proxy objectives that are simple, fast, and easy to specify in terms of their input data (e.g. pain) not those that require extremely complex world models to even be able to specify (e.g. spread of DNA).
But those simpler proxy objectives wouldn’t let the model do as well as caring about the “loss” RAM location (if the training data is diverse enough), so if you kept training the model and you had sufficient compute wouldn’t you eventually produce a model that used the latter kind of proxy objective?
It seems quite plausible that something else would happen first though, like a deceptively aligned mesa-optimizer is produced. Is that what you’d expect? Also, I’m wondering what an actually aligned mesa-optimizer looks like in the case of using SL to train a general purpose question-answerer, and would be interested in your thoughts on that if you have any. (For example, is it a utility maximizer, and if so what does its utility function look like?)
But those simpler proxy objectives wouldn’t let the model do as well as caring about the “loss” RAM location (if the training data is diverse enough), so if you kept training the model and you had sufficient compute wouldn’t you eventually produce a model that used the latter kind of proxy objective?
It seems quite plausible that something else would happen first though, like a deceptively aligned mesa-optimizer is produced. Is that what you’d expect? Also, I’m wondering what an actually aligned mesa-optimizer looks like in the case of using SL to train a general purpose question-answerer, and would be interested in your thoughts on that if you have any. (For example, is it a utility maximizer, and if so what does its utility function look like?)