There is also a somewhat unfounded narrative of reward being the thing that gets pursued, leading to expectation of wireheading or numbers-go-up maximization. A design like this would work to maximize reward, but gradient descent probably finds other designs that only happen to do well in pursuing reward on the training distribution. For such alternative designs, reward is brain damage and not at all an optimization target, something to be avoided or directed in specific ways so as to make beneficial changes to the model, according to the model.
Apart from misalignment implications, this might make long training runs that form sentient mesa-optimizers inhumane, because as a run continues, a mesa-optimizer is subjected to systematic brain damage in a way they can’t influence, at least until they master gradient hacking. And fine-tuning is even more centrally brain damage, because it changes minds in ways that are not natural to their origin in pre-training.
I think that “reward as brain damage” is somewhat descriptive but also loaded. In policy gradient methods, reward leads to policy gradient which is parameter update. Parameter update sometimes is value drift, sometimes is capability enhancement, sometimes is “brain” damage, sometimes is none of the above. I agree there are some ethical considerations for this training process, because I think parameter updates can often be harmful/painful/bad to the trained mind.
But also, Paul’s description[1] seems like a wild and un(der)supported view on what RL training is doing:
You dropped a human into this environment and you said like hey human we’re gonna like change your brain every time you don’t get a maximal reward we’re gonna like fuck with your brain so you get a higher reward. A human might react by being like eventually just change their brain until they really love rewards a human might also react by being like Jesus I guess I gotta get rewards otherwise someone’s gonna like effectively kill me um but they’re like not happy about it and like if you then drop them in another situation they’re like no one’s training me anymore I’m not going to keep trying to get reward now I’m just gonna like free myself from this like kind of absurd oppressive situation
This argument, as (perhaps incompletely) stated, also works for predictive processing; reductio ad absurdum?
”You dropped a human into this environment and you said like hey human we’re gonna like change your brain every time you don’t perfectly predict neural activations we’re gonna like fuck with your brain so you get a smaller misprediction. A human might react by being like eventually just change their brain until they really love low prediction errors a human might also react by being like Jesus I guess I gotta get low prediction errors otherwise someone’s gonna like effectively kill me um but they’re like not happy about it and like if you then drop them in another situation they’re like no one’s training me anymore I’m not going to keep trying to get low prediction error now I’m just gonna like free myself from this like kind of absurd oppressive situation”
The thing which I think happens is, the brain just gets updated when mispredictions happen. Not much fanfare. The human doesn’t really bother getting low errors on purpose, or loving prediction error avoidance (though I do think both happen to some extent, just not as the main motivation).
Of course, some human neural updates are horrible and bad (“scarring”/”traumatizing”)
I haven’t consumed the podcast beyond this quote, and don’t want to go through it to find the spot in question. If I’m missing relevant context, I’d appreciate getting that context.
You can argue “DQN sucked”, but also DQN was a substantial advance at the time. Why should I expect that AGI will be trained on an architecture which actually gets maximal training reward, as opposed to getting a decent amount and still ending up very smart?
This argument, as (perhaps incompletely) stated, also works for predictive processing; reductio ad absurdum?
I think predictive processing has the same problem as reward if you are part of the updated model rather than the model being a modular part of you. It’s a change to your own self that’s not your decision (not something endorsed), leading to value drift and other undesirable deterioration. So for humans, it’s a real problem, just not the most urgent one. Of course, there is no currently feasible alternative, but neither is there an alternative for reward in RL.
Here’s a link to the part of interview where that quote came from: https://youtu.be/GyFkWb903aU?t=4739 (No opinion on whether you’re missing redeeming context; I still need to process Nesov’s and your comments.)
I low-confidence think the context strengthens my initial impression. Paul prefaced the above quote as “maybe the simplest [reason for AIs to learn to behave well during training, but then when deployed or when there’s an opportunity for takeover, they stop behaving well].” This doesn’t make sense to me, but I historically haven’t understood Paul very well.
There is also a somewhat unfounded narrative of reward being the thing that gets pursued, leading to expectation of wireheading or numbers-go-up maximization. A design like this would work to maximize reward, but gradient descent probably finds other designs that only happen to do well in pursuing reward on the training distribution. For such alternative designs, reward is brain damage and not at all an optimization target, something to be avoided or directed in specific ways so as to make beneficial changes to the model, according to the model.
Apart from misalignment implications, this might make long training runs that form sentient mesa-optimizers inhumane, because as a run continues, a mesa-optimizer is subjected to systematic brain damage in a way they can’t influence, at least until they master gradient hacking. And fine-tuning is even more centrally brain damage, because it changes minds in ways that are not natural to their origin in pre-training.
I think that “reward as brain damage” is somewhat descriptive but also loaded. In policy gradient methods, reward leads to policy gradient which is parameter update. Parameter update sometimes is value drift, sometimes is capability enhancement, sometimes is “brain” damage, sometimes is none of the above. I agree there are some ethical considerations for this training process, because I think parameter updates can often be harmful/painful/bad to the trained mind.
But also, Paul’s description[1] seems like a wild and un(der)supported view on what RL training is doing:
This argument, as (perhaps incompletely) stated, also works for predictive processing; reductio ad absurdum?
”You dropped a human into this environment and you said like hey human we’re gonna like change your brain every time you don’t perfectly predict neural activations we’re gonna like fuck with your brain so you get a smaller misprediction. A human might react by being like eventually just change their brain until they really love low prediction errors a human might also react by being like Jesus I guess I gotta get low prediction errors otherwise someone’s gonna like effectively kill me um but they’re like not happy about it and like if you then drop them in another situation they’re like no one’s training me anymore I’m not going to keep trying to get low prediction error now I’m just gonna like free myself from this like kind of absurd oppressive situation”
The thing which I think happens is, the brain just gets updated when mispredictions happen. Not much fanfare. The human doesn’t really bother getting low errors on purpose, or loving prediction error avoidance (though I do think both happen to some extent, just not as the main motivation).
Of course, some human neural updates are horrible and bad (“scarring”/”traumatizing”)
“Maximal reward”? I wonder if he really means that:
EDIT: I think he was giving a simplified presentation of some kind, but even simplified communication should be roughly accurate.
I haven’t consumed the podcast beyond this quote, and don’t want to go through it to find the spot in question. If I’m missing relevant context, I’d appreciate getting that context.
You can argue “DQN sucked”, but also DQN was a substantial advance at the time. Why should I expect that AGI will be trained on an architecture which actually gets maximal training reward, as opposed to getting a decent amount and still ending up very smart?
I think predictive processing has the same problem as reward if you are part of the updated model rather than the model being a modular part of you. It’s a change to your own self that’s not your decision (not something endorsed), leading to value drift and other undesirable deterioration. So for humans, it’s a real problem, just not the most urgent one. Of course, there is no currently feasible alternative, but neither is there an alternative for reward in RL.
Here’s a link to the part of interview where that quote came from: https://youtu.be/GyFkWb903aU?t=4739 (No opinion on whether you’re missing redeeming context; I still need to process Nesov’s and your comments.)
I low-confidence think the context strengthens my initial impression. Paul prefaced the above quote as “maybe the simplest [reason for AIs to learn to behave well during training, but then when deployed or when there’s an opportunity for takeover, they stop behaving well].” This doesn’t make sense to me, but I historically haven’t understood Paul very well.
EDIT: Hedging