It’s great that you are trying to develop a more detailed understanding of inner alignment. I noticed that you didn’t talk about deception much. In particular, the statement below is false:
Generalization ⇔ accurate priors + diverse data
You have to worry about what John Wentworth calls ‘manipulation of imperfect search.’ You can have accurate priors and diverse data and (unless you have infinite data) the training process could produce a deceptive agent that is able to maintain its misalignment.
Are you referring to this post? I hadn’t read that, thanks for pointing me in that direction. I think technically my subtitle is still correct, because the way I defined priors in the footnotes covers any part of the training procedure that biases it toward some hypotheses over others. So if the training procedure is likely to be hijacked by “greedy genes” then it wouldn’t count as having an “accurate prior”.
I like the learning theory perspective because it allows us to mostly ignore optimization procedures, making it easier to think about things. This perspective works nicely until the outer optimization process can be manipulated by the hypothesis. After reading John’s post I think I did lean too hard on the learning theory perspective.
I didn’t have much to say about deception because I considered it to be a straightforward extension of inner misalignment, but I think I was wrong, the “optimization demon” perspective is a good way to think about it.
It’s great that you are trying to develop a more detailed understanding of inner alignment. I noticed that you didn’t talk about deception much. In particular, the statement below is false:
You have to worry about what John Wentworth calls ‘manipulation of imperfect search.’ You can have accurate priors and diverse data and (unless you have infinite data) the training process could produce a deceptive agent that is able to maintain its misalignment.
Thanks for reading!
Are you referring to this post? I hadn’t read that, thanks for pointing me in that direction. I think technically my subtitle is still correct, because the way I defined priors in the footnotes covers any part of the training procedure that biases it toward some hypotheses over others. So if the training procedure is likely to be hijacked by “greedy genes” then it wouldn’t count as having an “accurate prior”.
I like the learning theory perspective because it allows us to mostly ignore optimization procedures, making it easier to think about things. This perspective works nicely until the outer optimization process can be manipulated by the hypothesis. After reading John’s post I think I did lean too hard on the learning theory perspective.
I didn’t have much to say about deception because I considered it to be a straightforward extension of inner misalignment, but I think I was wrong, the “optimization demon” perspective is a good way to think about it.