Corrigibility would render Chris’s idea unnecessary, but doesn’t actually argue against why Chris’s idea wouldn’t work. Unless there’s some argument for “If you could implement Chris’s idea, you could also implement corrigibility” or something along those lines.
It can be rephrased as a variation of the off button, where rather than just turning itself off, it runs NOPs, and rather than getting pushed manually, it’s triggered by escaping (however that could be defined). A lot of the problems raised in the original paper should also apply to honeypots.
Check out corrigibility (someone really should write that tag...), starting from this paper.
Corrigibility would render Chris’s idea unnecessary, but doesn’t actually argue against why Chris’s idea wouldn’t work. Unless there’s some argument for “If you could implement Chris’s idea, you could also implement corrigibility” or something along those lines.
It can be rephrased as a variation of the off button, where rather than just turning itself off, it runs NOPs, and rather than getting pushed manually, it’s triggered by escaping (however that could be defined). A lot of the problems raised in the original paper should also apply to honeypots.