After reading some of the newer MIRI dialogues, I’m less convinced than I once was that I know what “corrigibility” actually is. Could you say a few words about what kind of behavior you concretely expect to see from a “corrigible” agent, followed by how [you expect] those behaviors [to] fit into the “trajectory-constraining” framework you propose in your post?
EDIT: This is not purely a question for Steven, incidentally (or at least, the first half isn’t); anyone else who wants to take a shot at answering should feel free to do so. In particular I’d be interested in hearing answers from Eliezer or anyone else historically involved in the invention of the term.
My understanding: a corrigible paperclip-maximizer does all the paperclip-maximizing, but then when you realize it’s gonna end the world, you go to turn it off, and it doesn’t stop you. It’s corrigible!
There are a bunch of different definitions, but if you’re asking for Eliezer’s version, then the arbital expoisition is quite good. N.B. we don’t have a model for this sort of corrigibility.
EDIT: Be warned, these are rough summaries of the defs. I’d ammend the CHAI def I cited to “the AI obeys more when it knows less, models you as more rational, and the downsides of disobedience are lesser”. But people at CHAI have diverse views, so this is not the definitive CHAI take.
Other definitions include some people at CHAI’s definition (the AI obeys you whilst it doesn’t know what its utility function is), the definition used in the reward tampering paper (near the same as EY’s original def, barring the honesty clause, and formalised in a causal diagram setting), Stuart Armstrong’s many definitions which most notably includes Utility Indifference (note the agent is NOT a standard R-maximiser) so it accepts having its utility function changed at a later time as you’re going to compensate it for its loss in utility. So it is indiferent to the change (this doesn’t mean it won’t kill you for spare parts though). And TurnTrout has what looks like some interesting thoughts on the topic here but I haven’t read those yet.
Edit2: Paul thinks corrigibility has a simpler core than alignment, but is quite messy, and we won’t get a crisp algorithm for it. But the intuition is the same as what Eliezer was pointing to, namely that the AI knows it should defer to the human, and will seek to preserve that deference in it and its offspring. Plus being honest and helpful. Here is a post where he rambles about it.
After reading some of the newer MIRI dialogues, I’m less convinced than I once was that I know what “corrigibility” actually is. Could you say a few words about what kind of behavior you concretely expect to see from a “corrigible” agent, followed by how [you expect] those behaviors [to] fit into the “trajectory-constraining” framework you propose in your post?
EDIT: This is not purely a question for Steven, incidentally (or at least, the first half isn’t); anyone else who wants to take a shot at answering should feel free to do so. In particular I’d be interested in hearing answers from Eliezer or anyone else historically involved in the invention of the term.
My understanding: a corrigible paperclip-maximizer does all the paperclip-maximizing, but then when you realize it’s gonna end the world, you go to turn it off, and it doesn’t stop you. It’s corrigible!
There are a bunch of different definitions, but if you’re asking for Eliezer’s version, then the arbital expoisition is quite good. N.B. we don’t have a model for this sort of corrigibility.
EDIT: Be warned, these are rough summaries of the defs. I’d ammend the CHAI def I cited to “the AI obeys more when it knows less, models you as more rational, and the downsides of disobedience are lesser”. But people at CHAI have diverse views, so this is not the definitive CHAI take.
Other definitions include some people at CHAI’s definition (the AI obeys you whilst it doesn’t know what its utility function is), the definition used in the reward tampering paper (near the same as EY’s original def, barring the honesty clause, and formalised in a causal diagram setting), Stuart Armstrong’s many definitions which most notably includes Utility Indifference (note the agent is NOT a standard R-maximiser) so it accepts having its utility function changed at a later time as you’re going to compensate it for its loss in utility. So it is indiferent to the change (this doesn’t mean it won’t kill you for spare parts though). And TurnTrout has what looks like some interesting thoughts on the topic here but I haven’t read those yet.
Edit2: Paul thinks corrigibility has a simpler core than alignment, but is quite messy, and we won’t get a crisp algorithm for it. But the intuition is the same as what Eliezer was pointing to, namely that the AI knows it should defer to the human, and will seek to preserve that deference in it and its offspring. Plus being honest and helpful. Here is a post where he rambles about it.