Jeremy Gillen comments on Explaining inner alignment to myself

Jeremy Gillen 21 Jun 2022 7:41 UTC
2 points
Thanks for reading!
Are you referring to this post? I hadn’t read that, thanks for pointing me in that direction. I think technically my subtitle is still correct, because the way I defined priors in the footnotes covers any part of the training procedure that biases it toward some hypotheses over others. So if the training procedure is likely to be hijacked by “greedy genes” then it wouldn’t count as having an “accurate prior”.
I like the learning theory perspective because it allows us to mostly ignore optimization procedures, making it easier to think about things. This perspective works nicely until the outer optimization process can be manipulated by the hypothesis. After reading John’s post I think I did lean too hard on the learning theory perspective.
I didn’t have much to say about deception because I considered it to be a straightforward extension of inner misalignment, but I think I was wrong, the “optimization demon” perspective is a good way to think about it.