Maybe there’s been a lot of non public work that I’m not privy to?
In Aug 2020 I gave formalizing corrigibility another shot, and got something interesting but wrong out the other end. Am planning to publish sometime, but beyond that I’m not aware of other attempts.
When I visited MIRI for a MIRI/CHAI social in 2018, I seriously suggested a break-out group in which we would figure out corrigibility (or the desirable property pointed at by corrigibility-adjacent intuitions) in two hours. I think more people should try this exact exercise more often—including myself.
Yeah, we’ve also spent a while (maybe ~5 hours total?) in various CHAI meetings (some of which you’ve attended) trying to figure out the various definitions of corrigibility to no avail, but those notes are obviously not public. :(
That being said I don’t think failing in several hours of meetings/a few unpublished attempts is that much evidence of the difficulty?
In Aug 2020 I gave formalizing corrigibility another shot, and got something interesting but wrong out the other end. Am planning to publish sometime, but beyond that I’m not aware of other attempts.
When I visited MIRI for a MIRI/CHAI social in 2018, I seriously suggested a break-out group in which we would figure out corrigibility (or the desirable property pointed at by corrigibility-adjacent intuitions) in two hours. I think more people should try this exact exercise more often—including myself.
Yeah, we’ve also spent a while (maybe ~5 hours total?) in various CHAI meetings (some of which you’ve attended) trying to figure out the various definitions of corrigibility to no avail, but those notes are obviously not public. :(
That being said I don’t think failing in several hours of meetings/a few unpublished attempts is that much evidence of the difficulty?
I just remembered (!) that I have more public writing disentangling various forms of corrigibility, and their benefits—Non-obstruction: A simple concept motivating corrigibility.