X-and-only-X is what I call the issue where the property that’s easy to verify and train is X, but the property you want is “this was optimized for X and only X and doesn’t contain a whole bunch of possible subtle bad Ys that could be hard to detect formulaically from the final output of the system”.
If X is “be a competent, catastrophe-free, corrigible act-based assistant”, it’s plausible to me that an AGI trained to do X is sufficient to lead humanity to a good outcome, even if X doesn’t capture human values. For example, the operator might have the AGI develop the technology for whole brain emulations, enabling human uploads that can solve the safety problem in earnest, after which the original AGI is shut down.
Being an act-based (and thus approval-directed) agent is doing a ton of heavy lifting in this picture. Humans obviously wouldn’t approve of daemons, so your AI would just try really hard to not do that. Humans obviously wouldn’t approve of a Rubik’s cube solution that modulates RAM to send GSM cellphone signals, so your AI would just try really hard to not do that.
I think most of the difficulty here is shoved into training an agent to actually have property X, instead of just some approximation of X. It’s plausible to me that this is actually straightforward, but it also feels plausible that X is a really hard property to impart (though still much easier to impart than “have human values”).
A crux for me whether property X is sufficient is whether the operator could avoid getting accidentally manipulated. (A corrigible assistant would never intentionally manipulate, but if it satisfies property X while more directly optimizing Y, it might accidentally manipulate the humans into doing some Y distinct from human values.) I feel very uncertain about this, but it currently seems plausible to me that some operators could successfully just use the assistant to solve the safety problem in earnest, and then shut down the original AGI.
Corrigibility is doing a ton of heavy lifting in this picture. Humans obviously wouldn’t approve of daemons, so your AI would just try really hard to not do that.
I’m a bit confused about how “corrigibility” is being used here. I thought it meant that the agent doesn’t resist correction, but here it seems to be used to mean something more like trying to only do things the overseer would approve of.
I thought we called the latter being “approval-directed” and that it was a separate idea from corrigibility. Am I confused?
Oops, I think I was conflating “corrigible agent” with “benign act-based agent”. You’re right that they’re separate ideas. I edited my original comment accordingly.
If X is “be a competent, catastrophe-free, corrigible act-based assistant”, it’s plausible to me that an AGI trained to do X is sufficient to lead humanity to a good outcome, even if X doesn’t capture human values. For example, the operator might have the AGI develop the technology for whole brain emulations, enabling human uploads that can solve the safety problem in earnest, after which the original AGI is shut down.
Being an act-based (and thus approval-directed) agent is doing a ton of heavy lifting in this picture. Humans obviously wouldn’t approve of daemons, so your AI would just try really hard to not do that. Humans obviously wouldn’t approve of a Rubik’s cube solution that modulates RAM to send GSM cellphone signals, so your AI would just try really hard to not do that.
I think most of the difficulty here is shoved into training an agent to actually have property X, instead of just some approximation of X. It’s plausible to me that this is actually straightforward, but it also feels plausible that X is a really hard property to impart (though still much easier to impart than “have human values”).
A crux for me whether property X is sufficient is whether the operator could avoid getting accidentally manipulated. (A corrigible assistant would never intentionally manipulate, but if it satisfies property X while more directly optimizing Y, it might accidentally manipulate the humans into doing some Y distinct from human values.) I feel very uncertain about this, but it currently seems plausible to me that some operators could successfully just use the assistant to solve the safety problem in earnest, and then shut down the original AGI.
I’m a bit confused about how “corrigibility” is being used here. I thought it meant that the agent doesn’t resist correction, but here it seems to be used to mean something more like trying to only do things the overseer would approve of.
I thought we called the latter being “approval-directed” and that it was a separate idea from corrigibility. Am I confused?
Oops, I think I was conflating “corrigible agent” with “benign act-based agent”. You’re right that they’re separate ideas. I edited my original comment accordingly.