The problem is that “do the right thing” makes no sense without a reference to what values, or more formally what utility functions the human in question has, so there’s no way to do what you propose to do even in theory, at least without strong assumptions on their values/utility functions.
Also, it breaks corrigiblity, and in many applications like military AI, this is a dangerous property to break, because you probably want to change their orders/actions, and this sort of anti-corrigiblity is usually bad unless you’re very confident value learning works, which I don’t share.
All language makes no sense without a method of interpretation. “Get me some coffee” is a horribly ambiguous instruction that any imagined assistant will have to cope with. How might an AI learn what “get me some coffee” entails without it being hardcoded in?
To say it’s impossible in theory is to set the bar so high that humans using language is also impossible.
As for military use of AGI, I think I’m fine with breaking that application. If we can build AI that does good things when directed to (which can incorporate some parts of corrigibility, like not being overly dogmatic and soliciting a broad swath of human feedback), then we should. If we cannot build AI that actually does good things, we haven’t solved alignment by my lights and building powerful AI is probably bad.
I think the biggest difference I have here is that I don’t think there is that much pressure to converge to a single value, or even that small of a space of values, at least in the multi-agent case, unlike in your communication examples, and I think the degrees of freedom for morality is pretty wide/large, unlike in the case of communication, where there is a way for even simple RL agents to converge on communication/language norms (at least in the non-adversarial case).
At a meta level, I’m more skeptical of value learning, especially the ambitious variant of value learning being a good first target than you seem to have, and think corrigibility/DWIMAC goals tend to be better than you think it does, primarily because I think the arguments for alignment dooming us has holes that make them not go through.
Strong optimization doesn’t need to ignore boundaries and tile the universe with optimal stuff according to its own aesthetics, disregarding the prior content of the universe (such as other people). The aesthetics can be about how the prior content is treated, the full trajectory it takes over time, rather than about what ends up happening after the tiling regardless of prior content.
The value of respect for autonomy doesn’t ask for values of others to converge, doesn’t need to agree with them to be an ally. So that’s an example of a good thing in a sense that isn’t fragile.
This is true; value alignment is quite possible. But if it’s both harder/less safe, and people would rather align their godling with their own values/commands, I think we should either expect this or make very strong arguments against it.
Respect for autonomy is not quite value alignment, just as corrigibility is not quite alignment. I’m pointing out that it might be possible to get a good outcome out of strong optimization without value alignment, because strong optimization can be sensitive to context of the past and so doesn’t naturally result in a past-insensitive tiling of the universe according to its values. Mostly it’s a thought experiment investigating some intuitions about what strong optimization has to be like, and thus importance and difficulty of targeting it precisely at particular values.
Not being a likely outcome is a separate issue, for example I don’t expect intent alignment in its undifferentiated form to remain secure enough to contain AI-originating agency. To the extent intent alignment grants arbitrary wishes, what I describe is an ingredient of a possible wish, one that’s distinct from value alignment and sidesteps the question of “alignment to whom” in a way different from both CEV and corrigibility. It’s not more clearly specified than CEV either, but it’s distinct from it.
In your use of respect for autonomy as a goal:; are you referring to something like Empowerment is (almost) All We Need? I do find that to be an appealing alignment target (I think I’m using alignment slightly more broadly, as in Hubinger’s definition. (I have a post in progress on the terminology of different alignment/goal targets and resulting confusions).
The problem with empowerment as an ASI goal is, once again: empowering whom? And do you empower them to make more like them that you then have to empower? Roger Dearnaley notes that if we empower everyone, humans will probably lose out to either something with less volition but using fewer resources, like insects, or something with more volition to empower, like other ASIs. Do we reallly want to limit the future to baseline humans? And how do we handle humans that want to create tons more humans?
I actually do expect intent alignment to remain secure enough to contain AI-originating agency, as long as it’s the primary goal or “’singular target”. It’s counterintuitive that a superintelligent being could want nothing more than to do what its principal wants it to do, but I think it’s coherent. And the more competent it gets, the better it will be at doing what you want and nothing more. Before it’s that competent, the principal can give more careful instructions, including instructions to check before acting, and to help with its alignment in various ways.
I agree that respect for autonomy/empowerment is one instruction/intent you could give. I do expect that someone will turn their intent-aligned AGI into an autonomous AGI at some point; hopefully after they’re quite confident in its alignment and the worth of that goal.
Respect for autonomy is not quite empowerment, it’s more like being left alone. The use of this concept is more in defining what it means for an agent or a civilization to develop relatively undisturbed, without getting overwritten by external influence, not in considering ways of helping it develop. So it’s also a building block for defining extrapolated volition, because that involves extended period of not getting destroyed by external influences. But it’s conceptually prior to extrapolated volition, it doesn’t depend on already knowing what it is, it’s a simpler notion.
It’s not by itself a good singular target to set an AI to pursue, for example it doesn’t protect humans from building more extinction-worthy AIs within their membranes, and doesn’t facilitate any sort of empowerment. But it seems simple enough and agreeable as a universal norm to be a plausible aspect of many naturally developing AI goals, and it doesn’t require absence of interaction, so allows empowerment etc. if that is also something others provide.
Yeah, I agree with your first paragraph. But I think it’s a difference of degree rather than kind. “Do the right thing” is still communication, it’s just communication about something indirect, that we nonetheless should be picky about.
The problem is that “do the right thing” makes no sense without a reference to what values, or more formally what utility functions the human in question has, so there’s no way to do what you propose to do even in theory, at least without strong assumptions on their values/utility functions.
Also, it breaks corrigiblity, and in many applications like military AI, this is a dangerous property to break, because you probably want to change their orders/actions, and this sort of anti-corrigiblity is usually bad unless you’re very confident value learning works, which I don’t share.
All language makes no sense without a method of interpretation. “Get me some coffee” is a horribly ambiguous instruction that any imagined assistant will have to cope with. How might an AI learn what “get me some coffee” entails without it being hardcoded in?
To say it’s impossible in theory is to set the bar so high that humans using language is also impossible.
As for military use of AGI, I think I’m fine with breaking that application. If we can build AI that does good things when directed to (which can incorporate some parts of corrigibility, like not being overly dogmatic and soliciting a broad swath of human feedback), then we should. If we cannot build AI that actually does good things, we haven’t solved alignment by my lights and building powerful AI is probably bad.
I think the biggest difference I have here is that I don’t think there is that much pressure to converge to a single value, or even that small of a space of values, at least in the multi-agent case, unlike in your communication examples, and I think the degrees of freedom for morality is pretty wide/large, unlike in the case of communication, where there is a way for even simple RL agents to converge on communication/language norms (at least in the non-adversarial case).
At a meta level, I’m more skeptical of value learning, especially the ambitious variant of value learning being a good first target than you seem to have, and think corrigibility/DWIMAC goals tend to be better than you think it does, primarily because I think the arguments for alignment dooming us has holes that make them not go through.
Strong optimization doesn’t need to ignore boundaries and tile the universe with optimal stuff according to its own aesthetics, disregarding the prior content of the universe (such as other people). The aesthetics can be about how the prior content is treated, the full trajectory it takes over time, rather than about what ends up happening after the tiling regardless of prior content.
The value of respect for autonomy doesn’t ask for values of others to converge, doesn’t need to agree with them to be an ally. So that’s an example of a good thing in a sense that isn’t fragile.
This is true; value alignment is quite possible. But if it’s both harder/less safe, and people would rather align their godling with their own values/commands, I think we should either expect this or make very strong arguments against it.
Respect for autonomy is not quite value alignment, just as corrigibility is not quite alignment. I’m pointing out that it might be possible to get a good outcome out of strong optimization without value alignment, because strong optimization can be sensitive to context of the past and so doesn’t naturally result in a past-insensitive tiling of the universe according to its values. Mostly it’s a thought experiment investigating some intuitions about what strong optimization has to be like, and thus importance and difficulty of targeting it precisely at particular values.
Not being a likely outcome is a separate issue, for example I don’t expect intent alignment in its undifferentiated form to remain secure enough to contain AI-originating agency. To the extent intent alignment grants arbitrary wishes, what I describe is an ingredient of a possible wish, one that’s distinct from value alignment and sidesteps the question of “alignment to whom” in a way different from both CEV and corrigibility. It’s not more clearly specified than CEV either, but it’s distinct from it.
In your use of respect for autonomy as a goal:; are you referring to something like Empowerment is (almost) All We Need? I do find that to be an appealing alignment target (I think I’m using alignment slightly more broadly, as in Hubinger’s definition. (I have a post in progress on the terminology of different alignment/goal targets and resulting confusions).
The problem with empowerment as an ASI goal is, once again: empowering whom? And do you empower them to make more like them that you then have to empower? Roger Dearnaley notes that if we empower everyone, humans will probably lose out to either something with less volition but using fewer resources, like insects, or something with more volition to empower, like other ASIs. Do we reallly want to limit the future to baseline humans? And how do we handle humans that want to create tons more humans?
See 4. A Moral Case for Evolved-Sapience-Chauvinism and 5. Moral Value for Sentient Animals? Alas, Not Yet from Roger’s AI, Alignment, and Ethics sequence.
I actually do expect intent alignment to remain secure enough to contain AI-originating agency, as long as it’s the primary goal or “’singular target”. It’s counterintuitive that a superintelligent being could want nothing more than to do what its principal wants it to do, but I think it’s coherent. And the more competent it gets, the better it will be at doing what you want and nothing more. Before it’s that competent, the principal can give more careful instructions, including instructions to check before acting, and to help with its alignment in various ways.
I agree that respect for autonomy/empowerment is one instruction/intent you could give. I do expect that someone will turn their intent-aligned AGI into an autonomous AGI at some point; hopefully after they’re quite confident in its alignment and the worth of that goal.
Respect for autonomy is not quite empowerment, it’s more like being left alone. The use of this concept is more in defining what it means for an agent or a civilization to develop relatively undisturbed, without getting overwritten by external influence, not in considering ways of helping it develop. So it’s also a building block for defining extrapolated volition, because that involves extended period of not getting destroyed by external influences. But it’s conceptually prior to extrapolated volition, it doesn’t depend on already knowing what it is, it’s a simpler notion.
It’s not by itself a good singular target to set an AI to pursue, for example it doesn’t protect humans from building more extinction-worthy AIs within their membranes, and doesn’t facilitate any sort of empowerment. But it seems simple enough and agreeable as a universal norm to be a plausible aspect of many naturally developing AI goals, and it doesn’t require absence of interaction, so allows empowerment etc. if that is also something others provide.
Yeah, I agree with your first paragraph. But I think it’s a difference of degree rather than kind. “Do the right thing” is still communication, it’s just communication about something indirect, that we nonetheless should be picky about.
I considered titling a different version of this post “we need to also solve the human alignment problem” or something similar.