I think that we have different pictures of what outer alignment scheme we’re considering. In the context of something like value learning, myopia would be a big capabilities hit, and what you’re suggesting might be better. In the context of amplification, however, myopia actually helps capabilities. For example, consider a pure supervised amplification model—i.e. I train the model to approximate a human consulting the model. In that case, a non-myopic model will try to produce outputs which make the human easier to predict in the future, which might not look very competent (e.g. output a blank string so the model only has to predict the human rather than predicting itself as well). On the other hand, if the model is properly myopic such that it is actually just trying to match the human as closely as possible, then you actually get an approximation of HCH, which is likely to be a lot more capable. That being said, unless you have a myopia guarantee like the one above, a competitive model might be deceptively myopic rather than actually myopic.
I think that we have different pictures of what outer alignment scheme we’re considering. In the context of something like value learning, myopia would be a big capabilities hit, and what you’re suggesting might be better. In the context of amplification, however, myopia actually helps capabilities. For example, consider a pure supervised amplification model—i.e. I train the model to approximate a human consulting the model. In that case, a non-myopic model will try to produce outputs which make the human easier to predict in the future, which might not look very competent (e.g. output a blank string so the model only has to predict the human rather than predicting itself as well). On the other hand, if the model is properly myopic such that it is actually just trying to match the human as closely as possible, then you actually get an approximation of HCH, which is likely to be a lot more capable. That being said, unless you have a myopia guarantee like the one above, a competitive model might be deceptively myopic rather than actually myopic.
I like this reply and I think there’s something subtle going on with the meaning of “myopic” here and I’m going to try to think about it more.