(Below is my response to my best understanding of your reply – let me know if you were trying to make a different point)
it can be simultaneously true that: ideal intent-aligned reasoners could just execute the expected-best policy, and that overcoming bias generally involves assessing the performance of your algorithm in a given situation, and also that it’s profitable to think about that aspect explicitly wrt corrigibility. So, I think I agree with you, but I’m interested in the heuristics that corrigible reasoning might tend to use?
(Below is my response to my best understanding of your reply – let me know if you were trying to make a different point)
it can be simultaneously true that: ideal intent-aligned reasoners could just execute the expected-best policy, and that overcoming bias generally involves assessing the performance of your algorithm in a given situation, and also that it’s profitable to think about that aspect explicitly wrt corrigibility. So, I think I agree with you, but I’m interested in the heuristics that corrigible reasoning might tend to use?