In particular, the main point of this proposal is that it does not require any mastery of ethical philosophy, just the rough knowledge of what humans tend to mean by what they say that LLMs already have. I see this as more of a hacky stopgap rather than an elegant solution.
I think maybe I sound naive phrasing it as “the AGI should just do what we say”, as though I’ve wandered in off the street and am proposing a “why not just...” alignment solution. I’m familiar with a lot of the arguments both for why corrigibility is impossible, and for why it’s maybe not that hard. I believe Paul Christiano’s use of corrigibility is similar to what I mean.
A better term than “just do what I tell it” is “do what I mean and check”. I’ve tried to describe this in Corrigibility or DWIM is an attractive primary goal for AGI. Checking, or clarifying when it’s uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function. But adding an explicit goal of checking when consequences are large or it’s uncertain about intent is another pragmatic, relatively hard-coded (at least in my vision of language model agent alignment) that reduces the agent acting on its galaxy-brained extrapolation of what you meant.
Whether we can implement this is very dependent on what sort of AGI is actually built first. I think that’s likely to be some variant of language model cognitive architecture, and I think we can rather easily implement it there. This isn’t a certain alignment solution; I propose more layers here. This isn’t the elegant, provable solution we’d like, but it seems to have a better chance of working than any other actual proposal I’ve seen.
I think maybe I sound naive phrasing it as “the AGI should just do what we say”, as though I’ve wandered in off the street and am proposing a “why not just...” alignment solution
Nah, I recall your takes tend to be considerably more reasonable than that.
I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don’t agree that “rough knowledge of what humans tend to mean” is sufficient.
The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven’t figured out yet.
These structures might end up immediately relevant to whatever command we give, on the AI’s better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.
For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that[1], but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.
The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command’s complexity.
Checking, or clarifying when it’s uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function
My money’s on our understanding of what we mean by “what we mean” being hopelessly confused, and that causing problems. Unless, again, we’ve figured out how to specify it in a mathematically precise manner – unless we know we’re not confused.
I think that acausal attacks is kinda galaxy-brained example, I have better one.
Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend, at which point the agent’s method of checking will diverge from our intended methods of checking.
quetzal_rainbow’s example is one case of that phenomenon.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human “means” actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI’s operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
Should it extrapolate the human’s values and execute the command the way the human would have wanted to execute it if they’d thought about it a lot, rather than the way they’re envisioning it in the moment?
For example, perhaps the image flashing through the human’s mind right now is of helicopters literally spraying the cure, but it’s actually more efficient to do it using airplanes.
Should it extrapolate the human’s values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
Should it extrapolate the human’s values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
Should it extrapolate the human’s values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
Should it extrapolate the human’s values a lot, interpret the command as “maximize eudaimonia”, and go do that, disregarding the specific way of how they gestured at the idea?
Should it remind the human that they’d wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
Etc.
There’s quite a lot of different ways by which you can slice the idea. There’s probably a way that corresponds to the intuitive meaning of “do what I mean”, but maybe there isn’t, and in any case we don’t yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what “DWIM” means doesn’t solve anything.)
And then, because of the general “unknown-unknown environmental structures” plus “compounding errors” problems, picking the wrong definition probably kills everyone.
I agree with all of that.
In particular, the main point of this proposal is that it does not require any mastery of ethical philosophy, just the rough knowledge of what humans tend to mean by what they say that LLMs already have. I see this as more of a hacky stopgap rather than an elegant solution.
I think maybe I sound naive phrasing it as “the AGI should just do what we say”, as though I’ve wandered in off the street and am proposing a “why not just...” alignment solution. I’m familiar with a lot of the arguments both for why corrigibility is impossible, and for why it’s maybe not that hard. I believe Paul Christiano’s use of corrigibility is similar to what I mean.
A better term than “just do what I tell it” is “do what I mean and check”. I’ve tried to describe this in Corrigibility or DWIM is an attractive primary goal for AGI. Checking, or clarifying when it’s uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function. But adding an explicit goal of checking when consequences are large or it’s uncertain about intent is another pragmatic, relatively hard-coded (at least in my vision of language model agent alignment) that reduces the agent acting on its galaxy-brained extrapolation of what you meant.
Whether we can implement this is very dependent on what sort of AGI is actually built first. I think that’s likely to be some variant of language model cognitive architecture, and I think we can rather easily implement it there. This isn’t a certain alignment solution; I propose more layers here. This isn’t the elegant, provable solution we’d like, but it seems to have a better chance of working than any other actual proposal I’ve seen.
Nah, I recall your takes tend to be considerably more reasonable than that.
I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don’t agree that “rough knowledge of what humans tend to mean” is sufficient.
The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven’t figured out yet.
These structures might end up immediately relevant to whatever command we give, on the AI’s better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.
For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that[1], but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.
The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command’s complexity.
That doesn’t work, though, if taken literally? I think what you’re envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that’d work.
My money’s on our understanding of what we mean by “what we mean” being hopelessly confused, and that causing problems. Unless, again, we’ve figured out how to specify it in a mathematically precise manner – unless we know we’re not confused.
I think that acausal attacks is kinda galaxy-brained example, I have better one. Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend, at which point the agent’s method of checking will diverge from our intended methods of checking.
quetzal_rainbow’s example is one case of that phenomenon.
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human “means” actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI’s operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
Should it extrapolate the human’s values and execute the command the way the human would have wanted to execute it if they’d thought about it a lot, rather than the way they’re envisioning it in the moment?
For example, perhaps the image flashing through the human’s mind right now is of helicopters literally spraying the cure, but it’s actually more efficient to do it using airplanes.
Should it extrapolate the human’s values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
Should it extrapolate the human’s values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
Should it extrapolate the human’s values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
Should it extrapolate the human’s values a lot, interpret the command as “maximize eudaimonia”, and go do that, disregarding the specific way of how they gestured at the idea?
Should it remind the human that they’d wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
Etc.
There’s quite a lot of different ways by which you can slice the idea. There’s probably a way that corresponds to the intuitive meaning of “do what I mean”, but maybe there isn’t, and in any case we don’t yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what “DWIM” means doesn’t solve anything.)
And then, because of the general “unknown-unknown environmental structures” plus “compounding errors” problems, picking the wrong definition probably kills everyone.