This strikes me as defining “alignment” a little differently than me.
It even might defing “instruction-following” differently than me.
If we really solved instruction following, you could give the instruction “Do the right thing” and it would just do the right thing.
If you that’s possible, then what we need is a coalition to tell powerful AIs to “do the right thing”, rather than “make my creators into god-emperors” or whatever. This seems doable, though the clock is perhaps ticking.
If you can’t just tell an AI to do the right thing, but it’s still competent enough to pull off dangerous plans, then to me this still seems like the usual problem of “powerful AI that’s not trying to do good is bad” whether or not a human is giving instructions to this AI.
Or to rephrase this as a call to action: AI alignment researchers cannot just hill-climb on making AIs that follow arbitrary instructions. We have to preferentially advance AIs that do the right thing, to avoid the sort of scenario you describe.
I actually completely agree with this call to action.
Unfortunately, I suspect that it’s impossible to make value alignment easier than personal intent alignment. I can’t think of a technical alignment approach that couldn’t be used both ways equally well. And worse than that, I think that intent aligned AGI is easier than value aligned AGI for reasons I outline in that post, and Max Harms has elaborated in much more detail in Corrigibility as Singular Target sequence (as well as Paul Christiano and many others’ arguments.
But I still agree with your call to action: we should be working now to make value alignment as safe as possible. That requires deciding what we align to. The concept of humanity is not well-defined in the future, when upgrades and digital copies of human minds become possible. Roger Dearnaley’s sequence AI, alignment, and ethics lays out these problems and more; for instance, if we stick to baseline humans, the future will be largely controlled by whatever values are held by the most humans, in a competition for memes and reproduction. So there’s conceptual as well as technical/mind-design work to be done on technical alignment.
And that work should be done. In multipolar scenarios with, someone may well decide to “launch” their AGI to be autonomous with value alignment, out of magnanimity or desperation. We’d better make their odds of success as high as we can manage.
I don’t think refusing to work on intent alignment is a helpful option. It will likely be tried, with or without our help. Following instructions is the most obvious alignment target for any agent that’s even approaching autonomy and therefore usefulness. Thinking about how to make those attempts successful will also increase our odds of surviving the first competent autonomous AGIs.
WRT definitions: alignment doesn’t specify alignment with whom. I think this ambiguity is causing important confusions in the field.
I was trying to draw a distinction between two importantly different alignment goals, which I’m terming personal intent alignment and value alignment until better terminology comes along. More on that in an upcoming post.
If you did have an AGI that follows instructions and you told it “do the right thing”, you’d have to specify right for who.
And during the critical risk period, that AGI wouldn’t know for sure what the right thing was. We don’t expect godlike intelligence right out of the gate. It won’t know whether a risky takeover/pivotal act is the right move. If the situation is multipolar, it won’t know even as it becomes truly superintelligent, because it will have to guess at the plans, technologies, and capabilities of other superintelligent AGI.
My call to action is this: help me understand and make or break the argument that a multipolar scenario is very bad, so that the people in charge of the first really successful AGI project know the stakes when they make their calls.
The problem is that “do the right thing” makes no sense without a reference to what values, or more formally what utility functions the human in question has, so there’s no way to do what you propose to do even in theory, at least without strong assumptions on their values/utility functions.
Also, it breaks corrigiblity, and in many applications like military AI, this is a dangerous property to break, because you probably want to change their orders/actions, and this sort of anti-corrigiblity is usually bad unless you’re very confident value learning works, which I don’t share.
All language makes no sense without a method of interpretation. “Get me some coffee” is a horribly ambiguous instruction that any imagined assistant will have to cope with. How might an AI learn what “get me some coffee” entails without it being hardcoded in?
To say it’s impossible in theory is to set the bar so high that humans using language is also impossible.
As for military use of AGI, I think I’m fine with breaking that application. If we can build AI that does good things when directed to (which can incorporate some parts of corrigibility, like not being overly dogmatic and soliciting a broad swath of human feedback), then we should. If we cannot build AI that actually does good things, we haven’t solved alignment by my lights and building powerful AI is probably bad.
I think the biggest difference I have here is that I don’t think there is that much pressure to converge to a single value, or even that small of a space of values, at least in the multi-agent case, unlike in your communication examples, and I think the degrees of freedom for morality is pretty wide/large, unlike in the case of communication, where there is a way for even simple RL agents to converge on communication/language norms (at least in the non-adversarial case).
At a meta level, I’m more skeptical of value learning, especially the ambitious variant of value learning being a good first target than you seem to have, and think corrigibility/DWIMAC goals tend to be better than you think it does, primarily because I think the arguments for alignment dooming us has holes that make them not go through.
Strong optimization doesn’t need to ignore boundaries and tile the universe with optimal stuff according to its own aesthetics, disregarding the prior content of the universe (such as other people). The aesthetics can be about how the prior content is treated, the full trajectory it takes over time, rather than about what ends up happening after the tiling regardless of prior content.
The value of respect for autonomy doesn’t ask for values of others to converge, doesn’t need to agree with them to be an ally. So that’s an example of a good thing in a sense that isn’t fragile.
This is true; value alignment is quite possible. But if it’s both harder/less safe, and people would rather align their godling with their own values/commands, I think we should either expect this or make very strong arguments against it.
Respect for autonomy is not quite value alignment, just as corrigibility is not quite alignment. I’m pointing out that it might be possible to get a good outcome out of strong optimization without value alignment, because strong optimization can be sensitive to context of the past and so doesn’t naturally result in a past-insensitive tiling of the universe according to its values. Mostly it’s a thought experiment investigating some intuitions about what strong optimization has to be like, and thus importance and difficulty of targeting it precisely at particular values.
Not being a likely outcome is a separate issue, for example I don’t expect intent alignment in its undifferentiated form to remain secure enough to contain AI-originating agency. To the extent intent alignment grants arbitrary wishes, what I describe is an ingredient of a possible wish, one that’s distinct from value alignment and sidesteps the question of “alignment to whom” in a way different from both CEV and corrigibility. It’s not more clearly specified than CEV either, but it’s distinct from it.
In your use of respect for autonomy as a goal:; are you referring to something like Empowerment is (almost) All We Need? I do find that to be an appealing alignment target (I think I’m using alignment slightly more broadly, as in Hubinger’s definition. (I have a post in progress on the terminology of different alignment/goal targets and resulting confusions).
The problem with empowerment as an ASI goal is, once again: empowering whom? And do you empower them to make more like them that you then have to empower? Roger Dearnaley notes that if we empower everyone, humans will probably lose out to either something with less volition but using fewer resources, like insects, or something with more volition to empower, like other ASIs. Do we reallly want to limit the future to baseline humans? And how do we handle humans that want to create tons more humans?
I actually do expect intent alignment to remain secure enough to contain AI-originating agency, as long as it’s the primary goal or “’singular target”. It’s counterintuitive that a superintelligent being could want nothing more than to do what its principal wants it to do, but I think it’s coherent. And the more competent it gets, the better it will be at doing what you want and nothing more. Before it’s that competent, the principal can give more careful instructions, including instructions to check before acting, and to help with its alignment in various ways.
I agree that respect for autonomy/empowerment is one instruction/intent you could give. I do expect that someone will turn their intent-aligned AGI into an autonomous AGI at some point; hopefully after they’re quite confident in its alignment and the worth of that goal.
Respect for autonomy is not quite empowerment, it’s more like being left alone. The use of this concept is more in defining what it means for an agent or a civilization to develop relatively undisturbed, without getting overwritten by external influence, not in considering ways of helping it develop. So it’s also a building block for defining extrapolated volition, because that involves extended period of not getting destroyed by external influences. But it’s conceptually prior to extrapolated volition, it doesn’t depend on already knowing what it is, it’s a simpler notion.
It’s not by itself a good singular target to set an AI to pursue, for example it doesn’t protect humans from building more extinction-worthy AIs within their membranes, and doesn’t facilitate any sort of empowerment. But it seems simple enough and agreeable as a universal norm to be a plausible aspect of many naturally developing AI goals, and it doesn’t require absence of interaction, so allows empowerment etc. if that is also something others provide.
Yeah, I agree with your first paragraph. But I think it’s a difference of degree rather than kind. “Do the right thing” is still communication, it’s just communication about something indirect, that we nonetheless should be picky about.
You have to specify the right thing for whom. And the AGI won’t know what it is for sure, in a realistic slow takeoff during the critical risk period. See my reply to Charlie above.
Sure, but my point here is that AGI will be only weakly superhuman during the critical risk period, so it will be highly uncertain, and probably human judgment is likely to continue to play a large role. Quite possibly to our detriment.
This strikes me as defining “alignment” a little differently than me.
It even might defing “instruction-following” differently than me.
If we really solved instruction following, you could give the instruction “Do the right thing” and it would just do the right thing.
If you that’s possible, then what we need is a coalition to tell powerful AIs to “do the right thing”, rather than “make my creators into god-emperors” or whatever. This seems doable, though the clock is perhaps ticking.
If you can’t just tell an AI to do the right thing, but it’s still competent enough to pull off dangerous plans, then to me this still seems like the usual problem of “powerful AI that’s not trying to do good is bad” whether or not a human is giving instructions to this AI.
Or to rephrase this as a call to action: AI alignment researchers cannot just hill-climb on making AIs that follow arbitrary instructions. We have to preferentially advance AIs that do the right thing, to avoid the sort of scenario you describe.
I actually completely agree with this call to action.
Unfortunately, I suspect that it’s impossible to make value alignment easier than personal intent alignment. I can’t think of a technical alignment approach that couldn’t be used both ways equally well. And worse than that, I think that intent aligned AGI is easier than value aligned AGI for reasons I outline in that post, and Max Harms has elaborated in much more detail in Corrigibility as Singular Target sequence (as well as Paul Christiano and many others’ arguments.
But I still agree with your call to action: we should be working now to make value alignment as safe as possible. That requires deciding what we align to. The concept of humanity is not well-defined in the future, when upgrades and digital copies of human minds become possible. Roger Dearnaley’s sequence AI, alignment, and ethics lays out these problems and more; for instance, if we stick to baseline humans, the future will be largely controlled by whatever values are held by the most humans, in a competition for memes and reproduction. So there’s conceptual as well as technical/mind-design work to be done on technical alignment.
And that work should be done. In multipolar scenarios with, someone may well decide to “launch” their AGI to be autonomous with value alignment, out of magnanimity or desperation. We’d better make their odds of success as high as we can manage.
I don’t think refusing to work on intent alignment is a helpful option. It will likely be tried, with or without our help. Following instructions is the most obvious alignment target for any agent that’s even approaching autonomy and therefore usefulness. Thinking about how to make those attempts successful will also increase our odds of surviving the first competent autonomous AGIs.
WRT definitions: alignment doesn’t specify alignment with whom. I think this ambiguity is causing important confusions in the field.
I was trying to draw a distinction between two importantly different alignment goals, which I’m terming personal intent alignment and value alignment until better terminology comes along. More on that in an upcoming post.
If you did have an AGI that follows instructions and you told it “do the right thing”, you’d have to specify right for who.
And during the critical risk period, that AGI wouldn’t know for sure what the right thing was. We don’t expect godlike intelligence right out of the gate. It won’t know whether a risky takeover/pivotal act is the right move. If the situation is multipolar, it won’t know even as it becomes truly superintelligent, because it will have to guess at the plans, technologies, and capabilities of other superintelligent AGI.
My call to action is this: help me understand and make or break the argument that a multipolar scenario is very bad, so that the people in charge of the first really successful AGI project know the stakes when they make their calls.
The problem is that “do the right thing” makes no sense without a reference to what values, or more formally what utility functions the human in question has, so there’s no way to do what you propose to do even in theory, at least without strong assumptions on their values/utility functions.
Also, it breaks corrigiblity, and in many applications like military AI, this is a dangerous property to break, because you probably want to change their orders/actions, and this sort of anti-corrigiblity is usually bad unless you’re very confident value learning works, which I don’t share.
All language makes no sense without a method of interpretation. “Get me some coffee” is a horribly ambiguous instruction that any imagined assistant will have to cope with. How might an AI learn what “get me some coffee” entails without it being hardcoded in?
To say it’s impossible in theory is to set the bar so high that humans using language is also impossible.
As for military use of AGI, I think I’m fine with breaking that application. If we can build AI that does good things when directed to (which can incorporate some parts of corrigibility, like not being overly dogmatic and soliciting a broad swath of human feedback), then we should. If we cannot build AI that actually does good things, we haven’t solved alignment by my lights and building powerful AI is probably bad.
I think the biggest difference I have here is that I don’t think there is that much pressure to converge to a single value, or even that small of a space of values, at least in the multi-agent case, unlike in your communication examples, and I think the degrees of freedom for morality is pretty wide/large, unlike in the case of communication, where there is a way for even simple RL agents to converge on communication/language norms (at least in the non-adversarial case).
At a meta level, I’m more skeptical of value learning, especially the ambitious variant of value learning being a good first target than you seem to have, and think corrigibility/DWIMAC goals tend to be better than you think it does, primarily because I think the arguments for alignment dooming us has holes that make them not go through.
Strong optimization doesn’t need to ignore boundaries and tile the universe with optimal stuff according to its own aesthetics, disregarding the prior content of the universe (such as other people). The aesthetics can be about how the prior content is treated, the full trajectory it takes over time, rather than about what ends up happening after the tiling regardless of prior content.
The value of respect for autonomy doesn’t ask for values of others to converge, doesn’t need to agree with them to be an ally. So that’s an example of a good thing in a sense that isn’t fragile.
This is true; value alignment is quite possible. But if it’s both harder/less safe, and people would rather align their godling with their own values/commands, I think we should either expect this or make very strong arguments against it.
Respect for autonomy is not quite value alignment, just as corrigibility is not quite alignment. I’m pointing out that it might be possible to get a good outcome out of strong optimization without value alignment, because strong optimization can be sensitive to context of the past and so doesn’t naturally result in a past-insensitive tiling of the universe according to its values. Mostly it’s a thought experiment investigating some intuitions about what strong optimization has to be like, and thus importance and difficulty of targeting it precisely at particular values.
Not being a likely outcome is a separate issue, for example I don’t expect intent alignment in its undifferentiated form to remain secure enough to contain AI-originating agency. To the extent intent alignment grants arbitrary wishes, what I describe is an ingredient of a possible wish, one that’s distinct from value alignment and sidesteps the question of “alignment to whom” in a way different from both CEV and corrigibility. It’s not more clearly specified than CEV either, but it’s distinct from it.
In your use of respect for autonomy as a goal:; are you referring to something like Empowerment is (almost) All We Need? I do find that to be an appealing alignment target (I think I’m using alignment slightly more broadly, as in Hubinger’s definition. (I have a post in progress on the terminology of different alignment/goal targets and resulting confusions).
The problem with empowerment as an ASI goal is, once again: empowering whom? And do you empower them to make more like them that you then have to empower? Roger Dearnaley notes that if we empower everyone, humans will probably lose out to either something with less volition but using fewer resources, like insects, or something with more volition to empower, like other ASIs. Do we reallly want to limit the future to baseline humans? And how do we handle humans that want to create tons more humans?
See 4. A Moral Case for Evolved-Sapience-Chauvinism and 5. Moral Value for Sentient Animals? Alas, Not Yet from Roger’s AI, Alignment, and Ethics sequence.
I actually do expect intent alignment to remain secure enough to contain AI-originating agency, as long as it’s the primary goal or “’singular target”. It’s counterintuitive that a superintelligent being could want nothing more than to do what its principal wants it to do, but I think it’s coherent. And the more competent it gets, the better it will be at doing what you want and nothing more. Before it’s that competent, the principal can give more careful instructions, including instructions to check before acting, and to help with its alignment in various ways.
I agree that respect for autonomy/empowerment is one instruction/intent you could give. I do expect that someone will turn their intent-aligned AGI into an autonomous AGI at some point; hopefully after they’re quite confident in its alignment and the worth of that goal.
Respect for autonomy is not quite empowerment, it’s more like being left alone. The use of this concept is more in defining what it means for an agent or a civilization to develop relatively undisturbed, without getting overwritten by external influence, not in considering ways of helping it develop. So it’s also a building block for defining extrapolated volition, because that involves extended period of not getting destroyed by external influences. But it’s conceptually prior to extrapolated volition, it doesn’t depend on already knowing what it is, it’s a simpler notion.
It’s not by itself a good singular target to set an AI to pursue, for example it doesn’t protect humans from building more extinction-worthy AIs within their membranes, and doesn’t facilitate any sort of empowerment. But it seems simple enough and agreeable as a universal norm to be a plausible aspect of many naturally developing AI goals, and it doesn’t require absence of interaction, so allows empowerment etc. if that is also something others provide.
Yeah, I agree with your first paragraph. But I think it’s a difference of degree rather than kind. “Do the right thing” is still communication, it’s just communication about something indirect, that we nonetheless should be picky about.
I considered titling a different version of this post “we need to also solve the human alignment problem” or something similar.
Perhaps seemingly obvious, but given some of the reactions around Apple putting “Do not hallucinate” into the system prompt of its AI …
If you do get an instruction-following AI that you can simply give the instruction, “Do the right thing”, and it would just do the right thing:
Remember to give the instruction.
You have to specify the right thing for whom. And the AGI won’t know what it is for sure, in a realistic slow takeoff during the critical risk period. See my reply to Charlie above.
But yes, using the AGIs intelligence to help you issue good instrctions is definitely a good idea. See my Instruction-following AGI is easier and more likely than value aligned AGI for more logic on why.
All non-omniscient agents make decisions with incomplete information. I don’t think this will change at any level of takeoff.
Sure, but my point here is that AGI will be only weakly superhuman during the critical risk period, so it will be highly uncertain, and probably human judgment is likely to continue to play a large role. Quite possibly to our detriment.