I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend, at which point the agent’s method of checking will diverge from our intended methods of checking.
quetzal_rainbow’s example is one case of that phenomenon.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human “means” actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI’s operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
Should it extrapolate the human’s values and execute the command the way the human would have wanted to execute it if they’d thought about it a lot, rather than the way they’re envisioning it in the moment?
For example, perhaps the image flashing through the human’s mind right now is of helicopters literally spraying the cure, but it’s actually more efficient to do it using airplanes.
Should it extrapolate the human’s values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
Should it extrapolate the human’s values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
Should it extrapolate the human’s values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
Should it extrapolate the human’s values a lot, interpret the command as “maximize eudaimonia”, and go do that, disregarding the specific way of how they gestured at the idea?
Should it remind the human that they’d wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
Etc.
There’s quite a lot of different ways by which you can slice the idea. There’s probably a way that corresponds to the intuitive meaning of “do what I mean”, but maybe there isn’t, and in any case we don’t yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what “DWIM” means doesn’t solve anything.)
And then, because of the general “unknown-unknown environmental structures” plus “compounding errors” problems, picking the wrong definition probably kills everyone.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend, at which point the agent’s method of checking will diverge from our intended methods of checking.
quetzal_rainbow’s example is one case of that phenomenon.
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human “means” actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI’s operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
Should it extrapolate the human’s values and execute the command the way the human would have wanted to execute it if they’d thought about it a lot, rather than the way they’re envisioning it in the moment?
For example, perhaps the image flashing through the human’s mind right now is of helicopters literally spraying the cure, but it’s actually more efficient to do it using airplanes.
Should it extrapolate the human’s values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
Should it extrapolate the human’s values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
Should it extrapolate the human’s values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
Should it extrapolate the human’s values a lot, interpret the command as “maximize eudaimonia”, and go do that, disregarding the specific way of how they gestured at the idea?
Should it remind the human that they’d wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
Etc.
There’s quite a lot of different ways by which you can slice the idea. There’s probably a way that corresponds to the intuitive meaning of “do what I mean”, but maybe there isn’t, and in any case we don’t yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what “DWIM” means doesn’t solve anything.)
And then, because of the general “unknown-unknown environmental structures” plus “compounding errors” problems, picking the wrong definition probably kills everyone.