I think outer and inner alignment both go against known/suspected grains of cognitive tendencies, whereas shard theory stories do not. I think that outer and inner alignment decompose a hard problem into two extremely hard problems, whereas shard theory is aiming to address a hard problem more naturally. I will have a post out in the next week or two being more specific, but I wanted to flag that I very much disagree with this quote.
Since the original draft I realized your position has “outer/inner alignment is a broken frame with mismatched type signatures which is much less likely to work than people think”, so this seems reasonable from your perspective. I haven’t thought much about this document and might end up agreeing with you, so the version I believe is something like “it’s not clear that my shard theory decomposition is substantially easier than inner+outer alignment is, assuming that inner+outer alignment is as valid as Evan thinks it is”.
Agree that I’m not being concrete about how corrigibility would be implemented. Concreteness is a virtue and it seems good to think about this in more detail eventually.
Since the original draft I realized your position has “outer/inner alignment is a broken frame with mismatched type signatures which is much less likely to work than people think”, so this seems reasonable from your perspective. I haven’t thought much about this document and might end up agreeing with you, so the version I believe is something like “it’s not clear that my shard theory decomposition is substantially easier than inner+outer alignment is, assuming that inner+outer alignment is as valid as Evan thinks it is”.
Agree that I’m not being concrete about how corrigibility would be implemented. Concreteness is a virtue and it seems good to think about this in more detail eventually.