You have a section titled
learning user preferences for corrigibility isn’t enough for corrigible behavior
Would this be more consistently titled “Learning narrow preferences for corrigibility isn’t enough for corrigible behavior”?
You have a section titled
Would this be more consistently titled “Learning narrow preferences for corrigibility isn’t enough for corrigible behavior”?