why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?
This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).
We know that any behavior is compatible with a utility function
People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)
This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).
People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)