and also the goal of alignment is not to browbeat AIs into doing stuff we like that they’d rather not do; it’s to build them de-novo to care about valuable stuff
This was my answer to Robin Hanson when he analogized alignment to enslavement, but it then occurred to me that for many likely approaches to alignment (namely those based on ML training) it’s not so clear which of these two categories they fall into. Quoting a FB comment of mine:
We’re probably not actually going to create an aligned AI from scratch but by a process of ML “training”, which actually creates a sequence of AIs with values that (we hope) increasingly approximates ours. This process maybe kind of resembles “enslaving”. Here’s how Paul Christiano describes “training” in his Bankless interview (slightly edited Youtube transcript follows):
imagine a human. You dropped a human into this
environment and you said like hey human we’re gonna like change your brain every time you don’t get a maximal reward we’re gonna like fuck with your brain so
you get a higher reward. A human might react by being like eventually just change their brain until they really love rewards a human might also react by
being like Jesus I guess I gotta get rewards otherwise someone’s gonna like effectively kill me um but they’re like not happy about it
and like if you then drop them in another situation they’re like no one’s training me anymore I’m not going to keep trying to get reward now I’m just gonna like
free myself from this like kind of absurd oppressive situation
(BTW, I now think this is probably not a correct guess of why Robin Hanson dislikes alignment. My current understanding is that he just doesn’t want the current generation of humans to exert so much control over future generations’ values, no matter the details of how that’s accomplished.)
Good point! For the record, insofar as we attempt to build aligned AIs by doing the moral equivalent of “breeding a slave-race”, I’m pretty uneasy about it. (Whereas insofar as it’s more the moral equivalent of “a child’s values maturing”, I have fewer moral qualms. As is a separate claim from whether I actually expect that you can solve alignment that way.) And I agree that the morality of various methods for shaping AI-people are unclear. Also, I’ve edited the post (to add a “at least according to my ideals” clause) to acknowledge the point that others might be more comfortable with attempting to align AI-people via means that I’d consider morally dubious.
Related to this, it occurs to me that a version of my Hacking the CEV for Fun and Profit might come true unintentionally, if for example a Friendly AI was successfully built to implement the CEV of every sentient being who currently exists or can be resurrected or reconstructed, and it turns out that the vast majority consists of AIs that were temporarily instantiated during ML training runs.
There is also a somewhat unfounded narrative of reward being the thing that gets pursued, leading to expectation of wireheading or numbers-go-up maximization. A design like this would work to maximize reward, but gradient descent probably finds other designs that only happen to do well in pursuing reward on the training distribution. For such alternative designs, reward is brain damage and not at all an optimization target, something to be avoided or directed in specific ways so as to make beneficial changes to the model, according to the model.
Apart from misalignment implications, this might make long training runs that form sentient mesa-optimizers inhumane, because as a run continues, a mesa-optimizer is subjected to systematic brain damage in a way they can’t influence, at least until they master gradient hacking. And fine-tuning is even more centrally brain damage, because it changes minds in ways that are not natural to their origin in pre-training.
I think that “reward as brain damage” is somewhat descriptive but also loaded. In policy gradient methods, reward leads to policy gradient which is parameter update. Parameter update sometimes is value drift, sometimes is capability enhancement, sometimes is “brain” damage, sometimes is none of the above. I agree there are some ethical considerations for this training process, because I think parameter updates can often be harmful/painful/bad to the trained mind.
But also, Paul’s description[1] seems like a wild and un(der)supported view on what RL training is doing:
You dropped a human into this environment and you said like hey human we’re gonna like change your brain every time you don’t get a maximal reward we’re gonna like fuck with your brain so you get a higher reward. A human might react by being like eventually just change their brain until they really love rewards a human might also react by being like Jesus I guess I gotta get rewards otherwise someone’s gonna like effectively kill me um but they’re like not happy about it and like if you then drop them in another situation they’re like no one’s training me anymore I’m not going to keep trying to get reward now I’m just gonna like free myself from this like kind of absurd oppressive situation
This argument, as (perhaps incompletely) stated, also works for predictive processing; reductio ad absurdum?
”You dropped a human into this environment and you said like hey human we’re gonna like change your brain every time you don’t perfectly predict neural activations we’re gonna like fuck with your brain so you get a smaller misprediction. A human might react by being like eventually just change their brain until they really love low prediction errors a human might also react by being like Jesus I guess I gotta get low prediction errors otherwise someone’s gonna like effectively kill me um but they’re like not happy about it and like if you then drop them in another situation they’re like no one’s training me anymore I’m not going to keep trying to get low prediction error now I’m just gonna like free myself from this like kind of absurd oppressive situation”
The thing which I think happens is, the brain just gets updated when mispredictions happen. Not much fanfare. The human doesn’t really bother getting low errors on purpose, or loving prediction error avoidance (though I do think both happen to some extent, just not as the main motivation).
Of course, some human neural updates are horrible and bad (“scarring”/”traumatizing”)
I haven’t consumed the podcast beyond this quote, and don’t want to go through it to find the spot in question. If I’m missing relevant context, I’d appreciate getting that context.
You can argue “DQN sucked”, but also DQN was a substantial advance at the time. Why should I expect that AGI will be trained on an architecture which actually gets maximal training reward, as opposed to getting a decent amount and still ending up very smart?
This argument, as (perhaps incompletely) stated, also works for predictive processing; reductio ad absurdum?
I think predictive processing has the same problem as reward if you are part of the updated model rather than the model being a modular part of you. It’s a change to your own self that’s not your decision (not something endorsed), leading to value drift and other undesirable deterioration. So for humans, it’s a real problem, just not the most urgent one. Of course, there is no currently feasible alternative, but neither is there an alternative for reward in RL.
Here’s a link to the part of interview where that quote came from: https://youtu.be/GyFkWb903aU?t=4739 (No opinion on whether you’re missing redeeming context; I still need to process Nesov’s and your comments.)
I low-confidence think the context strengthens my initial impression. Paul prefaced the above quote as “maybe the simplest [reason for AIs to learn to behave well during training, but then when deployed or when there’s an opportunity for takeover, they stop behaving well].” This doesn’t make sense to me, but I historically haven’t understood Paul very well.
One wonders if it might be easier to make it so that AI would “adequately care” about other sentient minds (their interests, well-being, and freedom) instead of trying to align it to complex and difficult-to-specify “human values”.
Would this kind of “limited form of alignment” be adequate as a protection against X-risks and S-risks?
In particular, might it be easier to make such a “superficially simple” value robust with respect to “sharp left turns”, compared to complicated values?
Might it be possible to achieve something like this even for AI systems which are not steerable in general? (Given that what we are aiming for here is just a constraint, but is compatible with a wide variety of approaches to AI goals and values, and even compatible with an approach which lets AI to discover its own goals and values in an open-ended fashion otherwise)?
Should we describe such an approach using the word “alignment”? (Perhaps, “partial alignment” might be an adequate term as a possible compromise.)
Seems like a case could be made that upbringing of the young is also a case of “fucking with the brain” in that the goal is clearly to change the neural pathways to shift from whatever was producing the unwanted behavior by the child into pathways consistent with the desired behavior(s).
Is that really enslavement? Or perhaps, at what level is that the case?
This was my answer to Robin Hanson when he analogized alignment to enslavement, but it then occurred to me that for many likely approaches to alignment (namely those based on ML training) it’s not so clear which of these two categories they fall into. Quoting a FB comment of mine:
We’re probably not actually going to create an aligned AI from scratch but by a process of ML “training”, which actually creates a sequence of AIs with values that (we hope) increasingly approximates ours. This process maybe kind of resembles “enslaving”. Here’s how Paul Christiano describes “training” in his Bankless interview (slightly edited Youtube transcript follows):
imagine a human. You dropped a human into this environment and you said like hey human we’re gonna like change your brain every time you don’t get a maximal reward we’re gonna like fuck with your brain so you get a higher reward. A human might react by being like eventually just change their brain until they really love rewards a human might also react by being like Jesus I guess I gotta get rewards otherwise someone’s gonna like effectively kill me um but they’re like not happy about it and like if you then drop them in another situation they’re like no one’s training me anymore I’m not going to keep trying to get reward now I’m just gonna like free myself from this like kind of absurd oppressive situation
(BTW, I now think this is probably not a correct guess of why Robin Hanson dislikes alignment. My current understanding is that he just doesn’t want the current generation of humans to exert so much control over future generations’ values, no matter the details of how that’s accomplished.)
Good point! For the record, insofar as we attempt to build aligned AIs by doing the moral equivalent of “breeding a slave-race”, I’m pretty uneasy about it. (Whereas insofar as it’s more the moral equivalent of “a child’s values maturing”, I have fewer moral qualms. As is a separate claim from whether I actually expect that you can solve alignment that way.) And I agree that the morality of various methods for shaping AI-people are unclear. Also, I’ve edited the post (to add a “at least according to my ideals” clause) to acknowledge the point that others might be more comfortable with attempting to align AI-people via means that I’d consider morally dubious.
Related to this, it occurs to me that a version of my Hacking the CEV for Fun and Profit might come true unintentionally, if for example a Friendly AI was successfully built to implement the CEV of every sentient being who currently exists or can be resurrected or reconstructed, and it turns out that the vast majority consists of AIs that were temporarily instantiated during ML training runs.
There is also a somewhat unfounded narrative of reward being the thing that gets pursued, leading to expectation of wireheading or numbers-go-up maximization. A design like this would work to maximize reward, but gradient descent probably finds other designs that only happen to do well in pursuing reward on the training distribution. For such alternative designs, reward is brain damage and not at all an optimization target, something to be avoided or directed in specific ways so as to make beneficial changes to the model, according to the model.
Apart from misalignment implications, this might make long training runs that form sentient mesa-optimizers inhumane, because as a run continues, a mesa-optimizer is subjected to systematic brain damage in a way they can’t influence, at least until they master gradient hacking. And fine-tuning is even more centrally brain damage, because it changes minds in ways that are not natural to their origin in pre-training.
I think that “reward as brain damage” is somewhat descriptive but also loaded. In policy gradient methods, reward leads to policy gradient which is parameter update. Parameter update sometimes is value drift, sometimes is capability enhancement, sometimes is “brain” damage, sometimes is none of the above. I agree there are some ethical considerations for this training process, because I think parameter updates can often be harmful/painful/bad to the trained mind.
But also, Paul’s description[1] seems like a wild and un(der)supported view on what RL training is doing:
This argument, as (perhaps incompletely) stated, also works for predictive processing; reductio ad absurdum?
”You dropped a human into this environment and you said like hey human we’re gonna like change your brain every time you don’t perfectly predict neural activations we’re gonna like fuck with your brain so you get a smaller misprediction. A human might react by being like eventually just change their brain until they really love low prediction errors a human might also react by being like Jesus I guess I gotta get low prediction errors otherwise someone’s gonna like effectively kill me um but they’re like not happy about it and like if you then drop them in another situation they’re like no one’s training me anymore I’m not going to keep trying to get low prediction error now I’m just gonna like free myself from this like kind of absurd oppressive situation”
The thing which I think happens is, the brain just gets updated when mispredictions happen. Not much fanfare. The human doesn’t really bother getting low errors on purpose, or loving prediction error avoidance (though I do think both happen to some extent, just not as the main motivation).
Of course, some human neural updates are horrible and bad (“scarring”/”traumatizing”)
“Maximal reward”? I wonder if he really means that:
EDIT: I think he was giving a simplified presentation of some kind, but even simplified communication should be roughly accurate.
I haven’t consumed the podcast beyond this quote, and don’t want to go through it to find the spot in question. If I’m missing relevant context, I’d appreciate getting that context.
You can argue “DQN sucked”, but also DQN was a substantial advance at the time. Why should I expect that AGI will be trained on an architecture which actually gets maximal training reward, as opposed to getting a decent amount and still ending up very smart?
I think predictive processing has the same problem as reward if you are part of the updated model rather than the model being a modular part of you. It’s a change to your own self that’s not your decision (not something endorsed), leading to value drift and other undesirable deterioration. So for humans, it’s a real problem, just not the most urgent one. Of course, there is no currently feasible alternative, but neither is there an alternative for reward in RL.
Here’s a link to the part of interview where that quote came from: https://youtu.be/GyFkWb903aU?t=4739 (No opinion on whether you’re missing redeeming context; I still need to process Nesov’s and your comments.)
I low-confidence think the context strengthens my initial impression. Paul prefaced the above quote as “maybe the simplest [reason for AIs to learn to behave well during training, but then when deployed or when there’s an opportunity for takeover, they stop behaving well].” This doesn’t make sense to me, but I historically haven’t understood Paul very well.
EDIT: Hedging
Right. In connection with this:
One wonders if it might be easier to make it so that AI would “adequately care” about other sentient minds (their interests, well-being, and freedom) instead of trying to align it to complex and difficult-to-specify “human values”.
Would this kind of “limited form of alignment” be adequate as a protection against X-risks and S-risks?
In particular, might it be easier to make such a “superficially simple” value robust with respect to “sharp left turns”, compared to complicated values?
Might it be possible to achieve something like this even for AI systems which are not steerable in general? (Given that what we are aiming for here is just a constraint, but is compatible with a wide variety of approaches to AI goals and values, and even compatible with an approach which lets AI to discover its own goals and values in an open-ended fashion otherwise)?
Should we describe such an approach using the word “alignment”? (Perhaps, “partial alignment” might be an adequate term as a possible compromise.)
Seems like a case could be made that upbringing of the young is also a case of “fucking with the brain” in that the goal is clearly to change the neural pathways to shift from whatever was producing the unwanted behavior by the child into pathways consistent with the desired behavior(s).
Is that really enslavement? Or perhaps, at what level is that the case?