“Instead of building in a shutdown button, build in a shutdown timer.”
-> Isn’t that a form of corrigibility with an added constraint? I’m not sure what would prevent you from convincing humans that it’s a bad thing to respect the timer, for example. Is it because we’ll formally verify we avoid deception instance? It’s not clear to me but maybe I’ve misunderstood.
“Instead of building in a shutdown button, build in a shutdown timer.”
-> Isn’t that a form of corrigibility with an added constraint? I’m not sure what would prevent you from convincing humans that it’s a bad thing to respect the timer, for example. Is it because we’ll formally verify we avoid deception instance? It’s not clear to me but maybe I’ve misunderstood.