evhub comments on Towards a mechanistic understanding of corrigibility

evhub 23 Aug 2019 1:44 UTC
LW: 10 AF: 7
AF
I think that we have different pictures of what outer alignment scheme we’re considering. In the context of something like value learning, myopia would be a big capabilities hit, and what you’re suggesting might be better. In the context of amplification, however, myopia actually helps capabilities. For example, consider a pure supervised amplification model—i.e. I train the model to approximate a human consulting the model. In that case, a non-myopic model will try to produce outputs which make the human easier to predict in the future, which might not look very competent (e.g. output a blank string so the model only has to predict the human rather than predicting itself as well). On the other hand, if the model is properly myopic such that it is actually just trying to match the human as closely as possible, then you actually get an approximation of HCH, which is likely to be a lot more capable. That being said, unless you have a myopia guarantee like the one above, a competitive model might be deceptively myopic rather than actually myopic.
What links here?
- Commentary on AGI Safety from First Principles by Richard_Ngo (23 Nov 2020 21:37 UTC; 81 points)
- abramdemski 24 Sep 2019 3:16 UTC
  LW: 5 AF: 4
  AF Parent
  I like this reply and I think there’s something subtle going on with the meaning of “myopic” here and I’m going to try to think about it more.