I think you are missing the goal of solving shutdown problem. If you solve shutdown problem, you can, with caveats, allow yourself to fail other alignment parts. You can say “shit, this superintelligence is not, actually, doing what I mean, and probably is going to kill me”, shutdown it and try again, entering the realm of iterative design.
The reason why you want separate mechanism for shutdown is because separate mechanism makes failure of multiple mechanisms less likely. If you train LLM only to DWIM and suddenly it fails, it is likely to ignore your shutdown command too.
I don’t think I’m missing the primary point, although I agree that you would want a separate shutdown mechanism in an ideal world. But the way MIRI posed the problem is that “you can’t get an AGI to shut down and this is awful”. But you can. They and Your Wentworth are making the problem hard by assuming an alignment goal that nobody is likely to try for in the real world. People are going to want an AGI that just does what they tell it to, rather than making galaxy-brained extrapolations about what they and perhaps all of humanity or all of future sentients “really want”.
I think there’s a real possibility that an AGI accepts but misundunderstands your complex requests but still accepts and understands the simple shutdown request. This allows you to perform more alignment work if something seems less than ideal in its behavior. That’s also the point of having a shutdown mechanism in the MIRI formulation of the problem.
The issue is that, by default, an AGI is going to make galaxy-brained extrapolations in response to simple requests, whether you like that or not. It’s simply part of figuring out what to do – translating its goals all around its world-model, propagating them up the abstraction levels, etc. Like a human’s decision where to send job applications and how to word them is rooted in what career they’d like to pursue is rooted in their life goals is rooted in their understanding of where the world is heading.
To our minds, there’s a natural cut-off point where that process goes from just understanding the request to engaging in alien moral philosophy. But that cut-off point isn’t objective: it’s based on a very complicated human prior of what counts as normal/sane and what’s excessive. Mechanistically, every step from parsing the wording to solving philosophy is just a continuous extension of the previous ones.
“An AGI that just does what you tell it to” is a very specific design specification where we ensure that this galaxy-brained extrapolation process, which an AGI is definitely and convergently going to want to do, results in it concluding that it wants to faithfully execute that request.
Whether that happens because we’ve attained so much mastery of moral philosophy that we could predict this process’ outcome from the inputs to it, or because we figured out how to cut the process short at the human-subjective point of sanity, or because we implemented some galaxy-brained scheme of our own like John’s post is outlining, shouldn’t matter, I think. Whatever has the best chance of working.
And I think somewhat-hacky hard-coded solutions have a better chance of working on the first try, than the sort of elegant solutions you’re likely envisioning. Elegant solutions require a well-developed theory of value. Hacky stopgap measures only require to know which pieces of your software product you need to hobble. (Which isn’t to say they require no theory. Certainly the current AI theory is so lacking we can’t even hack any halfway-workable stopgaps. But they provide an avenue of reducing how much theory you need, and how confident in it you need to be.)
In particular, the main point of this proposal is that it does not require any mastery of ethical philosophy, just the rough knowledge of what humans tend to mean by what they say that LLMs already have. I see this as more of a hacky stopgap rather than an elegant solution.
I think maybe I sound naive phrasing it as “the AGI should just do what we say”, as though I’ve wandered in off the street and am proposing a “why not just...” alignment solution. I’m familiar with a lot of the arguments both for why corrigibility is impossible, and for why it’s maybe not that hard. I believe Paul Christiano’s use of corrigibility is similar to what I mean.
A better term than “just do what I tell it” is “do what I mean and check”. I’ve tried to describe this in Corrigibility or DWIM is an attractive primary goal for AGI. Checking, or clarifying when it’s uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function. But adding an explicit goal of checking when consequences are large or it’s uncertain about intent is another pragmatic, relatively hard-coded (at least in my vision of language model agent alignment) that reduces the agent acting on its galaxy-brained extrapolation of what you meant.
Whether we can implement this is very dependent on what sort of AGI is actually built first. I think that’s likely to be some variant of language model cognitive architecture, and I think we can rather easily implement it there. This isn’t a certain alignment solution; I propose more layers here. This isn’t the elegant, provable solution we’d like, but it seems to have a better chance of working than any other actual proposal I’ve seen.
I think maybe I sound naive phrasing it as “the AGI should just do what we say”, as though I’ve wandered in off the street and am proposing a “why not just...” alignment solution
Nah, I recall your takes tend to be considerably more reasonable than that.
I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don’t agree that “rough knowledge of what humans tend to mean” is sufficient.
The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven’t figured out yet.
These structures might end up immediately relevant to whatever command we give, on the AI’s better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.
For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that[1], but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.
The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command’s complexity.
Checking, or clarifying when it’s uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function
My money’s on our understanding of what we mean by “what we mean” being hopelessly confused, and that causing problems. Unless, again, we’ve figured out how to specify it in a mathematically precise manner – unless we know we’re not confused.
I think that acausal attacks is kinda galaxy-brained example, I have better one.
Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend, at which point the agent’s method of checking will diverge from our intended methods of checking.
quetzal_rainbow’s example is one case of that phenomenon.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human “means” actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI’s operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
Should it extrapolate the human’s values and execute the command the way the human would have wanted to execute it if they’d thought about it a lot, rather than the way they’re envisioning it in the moment?
For example, perhaps the image flashing through the human’s mind right now is of helicopters literally spraying the cure, but it’s actually more efficient to do it using airplanes.
Should it extrapolate the human’s values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
Should it extrapolate the human’s values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
Should it extrapolate the human’s values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
Should it extrapolate the human’s values a lot, interpret the command as “maximize eudaimonia”, and go do that, disregarding the specific way of how they gestured at the idea?
Should it remind the human that they’d wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
Etc.
There’s quite a lot of different ways by which you can slice the idea. There’s probably a way that corresponds to the intuitive meaning of “do what I mean”, but maybe there isn’t, and in any case we don’t yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what “DWIM” means doesn’t solve anything.)
And then, because of the general “unknown-unknown environmental structures” plus “compounding errors” problems, picking the wrong definition probably kills everyone.
I think that assuming “if my approach fails it fails in convenient way” is not very favorable by Mr. Murphy line of reasoning absent some rigorous guarantees.
I take your point that it would be better to have a reliable shutdown switch that’s entirely separate from your alignment scheme, so that if it fails completely you have a backup. I don’t think that’s possible for a full ASI, in agreement with MIRI’s conclusion. It could be that Wentworth’s proposal would work, I’m not sure. At the least, the inclusion of two negotiating internal agents would seem to impose a high alignment tax. So I’m offering an alternative approach.
I’m not assuming my approach can fail in only that one way, I’m saying it does cover that one failure mode. Which seems cover part of what you were asking for above:
You can say “shit, this superintelligence is not, actually, doing what I mean, and probably is going to kill me”, shutdown it and try again, entering the realm of iterative design.
If the ASI has decided to just not do what I mean and to do somethingg else instead, then no it won’t shutdown. Alignment has failed for technical reasons, not theoretical ones. That seems possible for any alignment scheme. But if it’s misunderstood what I mean, or I didn’t think through the consequences of what I mean well enough, than I get to iteratively design my request.
I think you are missing the goal of solving shutdown problem. If you solve shutdown problem, you can, with caveats, allow yourself to fail other alignment parts. You can say “shit, this superintelligence is not, actually, doing what I mean, and probably is going to kill me”, shutdown it and try again, entering the realm of iterative design.
The reason why you want separate mechanism for shutdown is because separate mechanism makes failure of multiple mechanisms less likely. If you train LLM only to DWIM and suddenly it fails, it is likely to ignore your shutdown command too.
I don’t think I’m missing the primary point, although I agree that you would want a separate shutdown mechanism in an ideal world. But the way MIRI posed the problem is that “you can’t get an AGI to shut down and this is awful”. But you can. They and Your Wentworth are making the problem hard by assuming an alignment goal that nobody is likely to try for in the real world. People are going to want an AGI that just does what they tell it to, rather than making galaxy-brained extrapolations about what they and perhaps all of humanity or all of future sentients “really want”.
I think there’s a real possibility that an AGI accepts but misundunderstands your complex requests but still accepts and understands the simple shutdown request. This allows you to perform more alignment work if something seems less than ideal in its behavior. That’s also the point of having a shutdown mechanism in the MIRI formulation of the problem.
The issue is that, by default, an AGI is going to make galaxy-brained extrapolations in response to simple requests, whether you like that or not. It’s simply part of figuring out what to do – translating its goals all around its world-model, propagating them up the abstraction levels, etc. Like a human’s decision where to send job applications and how to word them is rooted in what career they’d like to pursue is rooted in their life goals is rooted in their understanding of where the world is heading.
To our minds, there’s a natural cut-off point where that process goes from just understanding the request to engaging in alien moral philosophy. But that cut-off point isn’t objective: it’s based on a very complicated human prior of what counts as normal/sane and what’s excessive. Mechanistically, every step from parsing the wording to solving philosophy is just a continuous extension of the previous ones.
“An AGI that just does what you tell it to” is a very specific design specification where we ensure that this galaxy-brained extrapolation process, which an AGI is definitely and convergently going to want to do, results in it concluding that it wants to faithfully execute that request.
Whether that happens because we’ve attained so much mastery of moral philosophy that we could predict this process’ outcome from the inputs to it, or because we figured out how to cut the process short at the human-subjective point of sanity, or because we implemented some galaxy-brained scheme of our own like John’s post is outlining, shouldn’t matter, I think. Whatever has the best chance of working.
And I think somewhat-hacky hard-coded solutions have a better chance of working on the first try, than the sort of elegant solutions you’re likely envisioning. Elegant solutions require a well-developed theory of value. Hacky stopgap measures only require to know which pieces of your software product you need to hobble. (Which isn’t to say they require no theory. Certainly the current AI theory is so lacking we can’t even hack any halfway-workable stopgaps. But they provide an avenue of reducing how much theory you need, and how confident in it you need to be.)
I agree with all of that.
In particular, the main point of this proposal is that it does not require any mastery of ethical philosophy, just the rough knowledge of what humans tend to mean by what they say that LLMs already have. I see this as more of a hacky stopgap rather than an elegant solution.
I think maybe I sound naive phrasing it as “the AGI should just do what we say”, as though I’ve wandered in off the street and am proposing a “why not just...” alignment solution. I’m familiar with a lot of the arguments both for why corrigibility is impossible, and for why it’s maybe not that hard. I believe Paul Christiano’s use of corrigibility is similar to what I mean.
A better term than “just do what I tell it” is “do what I mean and check”. I’ve tried to describe this in Corrigibility or DWIM is an attractive primary goal for AGI. Checking, or clarifying when it’s uncertain about meaning, is implied in a competent agent pursuing an imperfectly known utility function. But adding an explicit goal of checking when consequences are large or it’s uncertain about intent is another pragmatic, relatively hard-coded (at least in my vision of language model agent alignment) that reduces the agent acting on its galaxy-brained extrapolation of what you meant.
Whether we can implement this is very dependent on what sort of AGI is actually built first. I think that’s likely to be some variant of language model cognitive architecture, and I think we can rather easily implement it there. This isn’t a certain alignment solution; I propose more layers here. This isn’t the elegant, provable solution we’d like, but it seems to have a better chance of working than any other actual proposal I’ve seen.
Nah, I recall your takes tend to be considerably more reasonable than that.
I agree that DWIM is probably a good target if we can specify it in a mathematically precise manner. But I don’t agree that “rough knowledge of what humans tend to mean” is sufficient.
The concern is that the real world has a lot of structures that are unknown to us – fundamental physics, anthropics-like confusions regarding our place in everything-that-exists, timeless decision-theory weirdness, or highly abstract philosophical or social principles that we haven’t figured out yet.
These structures might end up immediately relevant to whatever command we give, on the AI’s better model of reality, in a way entirely unpredictable to us. For it to then actually do what we mean, in those conditions, is a much taller order.
For example, maybe it starts perceiving itself to be under an acausal attack by aliens, and then decide that the most faithful way to represent our request is to blow up the planet to spite the aliens. Almost certainly not literally that[1], but you get the idea. it may perceive something completely unexpected-to-us in the environment, and then its perception of that thing would interfere with its understanding of what we meant, even on requests that seem completely tame to us. The errors would then compound, resulting in a catastrophe.
The correct definition of DWIM would of course handle that. But a flawed, only-roughly-correct one? Each command we give would be rolling the dice on dying, with IMO pretty bad odds, and scaling exponentially with the command’s complexity.
That doesn’t work, though, if taken literally? I think what you’re envisioning here is a solution to the hard problem of corrigibility, which – well, sure, that’d work.
My money’s on our understanding of what we mean by “what we mean” being hopelessly confused, and that causing problems. Unless, again, we’ve figured out how to specify it in a mathematically precise manner – unless we know we’re not confused.
I think that acausal attacks is kinda galaxy-brained example, I have better one. Imagine that you are training superintelligent programmer. It writes code, you evaluate it and analyse vulnerabilities in code. Reward is calculated based on quality metrics, including number of vulnerabilities. In some moment your model becomes sufficiently smart to notice that you don’t see all vulnerabilities, because you are not superintelligence. I.e., in some moment ground-truth objective of training process becomes “produce code with vulnerabilities that only superintelligence can notice” instead of “produce code with no vulnerabilities”, because you see code, think “wow, so good code with no vulnerabilies” and assign maximum reward, while actually code is filled with them.
I think the bigger problem here is what happens when the agent ends up with an idea of “what we mean/intend” which is different from what we mean/intend, at which point the agent’s method of checking will diverge from our intended methods of checking.
quetzal_rainbow’s example is one case of that phenomenon.
Agreed; I did gesture at that in the footnote.
I think the main difficulty here is that humans store their values in a decompiled/incomplete format, and so merely pointing at what a human “means” actually still has to route through defining how we want to handle moral philosophy/value extrapolation.
E. g., suppose the AGI’s operator, in a moment of excitement after they activate their AGI for the first time, tells it to distribute a cure for aging. What should the AGI do?
Should it read off the surface-level momentary intent of this command, and go synthesize a cure for aging and spray it across the planet in the specific way the human is currently imagining?
Should it extrapolate the human’s values and execute the command the way the human would have wanted to execute it if they’d thought about it a lot, rather than the way they’re envisioning it in the moment?
For example, perhaps the image flashing through the human’s mind right now is of helicopters literally spraying the cure, but it’s actually more efficient to do it using airplanes.
Should it extrapolate the human’s values a bit, and point out specific issues with this plan that the human might think about later (e. g. that it might trigger various geopolitical actors into rash actions), then give the human a chance to abort?
Should it extrapolate the human’s values a bit more, and point out issues the human might not have thought of (including teaching the human any load-bearing concepts that are new to them)?
Should it extrapolate the human’s values a bit more still, and teach them various better cognitive protocols for self-reflection, so that they may better evaluate whether a given plan satisfies their values?
Should it extrapolate the human’s values a lot, interpret the command as “maximize eudaimonia”, and go do that, disregarding the specific way of how they gestured at the idea?
Should it remind the human that they’d wanted to be careful with how they use the AGI, and to clarify whether they actually want to proceed with something so high-impact right out of the gates?
Etc.
There’s quite a lot of different ways by which you can slice the idea. There’s probably a way that corresponds to the intuitive meaning of “do what I mean”, but maybe there isn’t, and in any case we don’t yet know what it is. (And the problem is recursive: telling it to DWIM when interpreting what “DWIM” means doesn’t solve anything.)
And then, because of the general “unknown-unknown environmental structures” plus “compounding errors” problems, picking the wrong definition probably kills everyone.
I think that assuming “if my approach fails it fails in convenient way” is not very favorable by Mr. Murphy line of reasoning absent some rigorous guarantees.
I take your point that it would be better to have a reliable shutdown switch that’s entirely separate from your alignment scheme, so that if it fails completely you have a backup. I don’t think that’s possible for a full ASI, in agreement with MIRI’s conclusion. It could be that Wentworth’s proposal would work, I’m not sure. At the least, the inclusion of two negotiating internal agents would seem to impose a high alignment tax. So I’m offering an alternative approach.
I’m not assuming my approach can fail in only that one way, I’m saying it does cover that one failure mode. Which seems cover part of what you were asking for above:
If the ASI has decided to just not do what I mean and to do somethingg else instead, then no it won’t shutdown. Alignment has failed for technical reasons, not theoretical ones. That seems possible for any alignment scheme. But if it’s misunderstood what I mean, or I didn’t think through the consequences of what I mean well enough, than I get to iteratively design my request.