A variant of Dialogic RL with improved corrigibility. Suppose that the AI’s prior allows a small probability for “universe W” whose semantics are, roughly speaking, “all my assumptions are wrong, need to shut down immediately”. In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question “W?” is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering “yes” if asked “W?” at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won’t arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.
Now, this won’t help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn’t ensure safety in itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace “shutdown” by “undo everything you did and then shutdown” but that gets us into thorny specifications issues. Perhaps it’s possible to tackle those issues by one of the approaches to “low impact”.
Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional “sanity test” precaution.
A variant of Dialogic RL with improved corrigibility. Suppose that the AI’s prior allows a small probability for “universe W” whose semantics are, roughly speaking, “all my assumptions are wrong, need to shut down immediately”. In other words, this is a universe where all our prior shaping is replaced by the single axiom that shutting down is much higher utility than anything else. Moreover, we add into the prior that assumption that the formal question “W?” is understood perfectly by the user even without any annotation. This means that, whenever the AI assigns a higher-than-threshold probability to the user answering “yes” if asked “W?” at any uncorrupt point in the future, the AI will shutdown immediately. We should also shape the prior s.t. corrupt futures also favor shutdown: this is reasonable in itself, but will also ensure that the AI won’t arrive at believing too many futures to be corrupt and thereby avoid the imperative to shutdown as response to a confirmation of W.
Now, this won’t help if the user only resolves to confirm W after something catastrophic already occurred, such as the AI releasing malign subagents into the wild. But, something of the sort is true for any corrigibility scheme: corrigibility is about allowing the user to make changes in the AI on eir own initiative, which can always be too late. This method doesn’t ensure safety in itself, just hardens a system that is supposed to be already close to safe.
It would be nice if we could replace “shutdown” by “undo everything you did and then shutdown” but that gets us into thorny specifications issues. Perhaps it’s possible to tackle those issues by one of the approaches to “low impact”.
Universe W should still be governed by a simplicity prior. This means that whenever the agent detects a salient pattern that contradicts the assumptions of its prior shaping, the probability of W increases leading to shutdown. This serves as an additional “sanity test” precaution.