One thing I found confusing about this post + Paul’s post “Worst-case guarantees” (2nd link in the OP: https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d) is that Paul says “This is the second guarantee from “Two guarantees,” and is basically corrigibility.” But you say: “Corrigibility seems to be one of the most promising candidates for such an acceptability condition”. So it seems like you guys might have somewhat different ideas about what corrigibility means.
Can you clarify what you think is the relationship?
I don’t think there’s really a disagreement there—I think what Paul’s saying is that he views corrigibility as the right way to get an acceptability guarantee.
One thing I found confusing about this post + Paul’s post “Worst-case guarantees” (2nd link in the OP: https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d) is that Paul says “This is the second guarantee from “Two guarantees,” and is basically corrigibility.” But you say: “Corrigibility seems to be one of the most promising candidates for such an acceptability condition”. So it seems like you guys might have somewhat different ideas about what corrigibility means.
Can you clarify what you think is the relationship?
I don’t think there’s really a disagreement there—I think what Paul’s saying is that he views corrigibility as the right way to get an acceptability guarantee.