Context-sensitivity: the goals that a corrigible AGI pursues should depend sensitively on the intentions of its human users when it’s run.
Default off: a corrigible AGI run in a context where the relevant intentions or instructions aren’t present shouldn’t do anything.
Explicitness: a corrigible AGI should explain its intentions at a range of different levels of abstraction before acting. If its plan stops being a central example of the explained intentions (e.g. due to unexpected events), it should default to a pre-specified fallback.
Goal robustness: a corrigible AGI should maintain corrigibility even after N adversarially-chosen gradient steps (the higher the better).
Satiability: a corrigible AGI should be able to pre-specify a rough level of completeness of a given task, and then shut down after reaching that point.
Quick brainstorm:
Context-sensitivity: the goals that a corrigible AGI pursues should depend sensitively on the intentions of its human users when it’s run.
Default off: a corrigible AGI run in a context where the relevant intentions or instructions aren’t present shouldn’t do anything.
Explicitness: a corrigible AGI should explain its intentions at a range of different levels of abstraction before acting. If its plan stops being a central example of the explained intentions (e.g. due to unexpected events), it should default to a pre-specified fallback.
Goal robustness: a corrigible AGI should maintain corrigibility even after N adversarially-chosen gradient steps (the higher the better).
Satiability: a corrigible AGI should be able to pre-specify a rough level of completeness of a given task, and then shut down after reaching that point.