What does this mean for alignment? How do we prevent AIs from behaving badly as a result of a similar “misgeneralization”? What alignment insights does the fleshed-out mechanistic story of humans coming to like ice cream provide?
As far as I can tell, the answer is: don’t reward your AIs for taking bad actions.
uh
is your proposal “use the true reward function, and then you won’t get misaligned AI”?
That’s all it would take, because the mechanistic story above requires a specific step where the human eats ice cream and activates their reward circuits. If you stop the human from receiving reward for eating ice cream, then the human no longer becomes more inclined to navigate towards eating ice cream in the future.
Note that I’m not saying this is an easy task, especially since modern RL methods often use learned reward functions whose exact contours are unknown to their creators.
But from what I can tell, Yudkowsky’s position is that we need an entirely new paradigm to even begin to address these sorts of failures.
These three paragraphs feel incoherent to me. The human eating ice cream and activating their reward circuits is exactly what you would expect under the current paradigm. Yudkowsky thinks this leads to misalignment; you agree. He says that you need a new paradigm to not have this problem. You disagree because you assume it’s possible under the current paradigm.
If so, how? Where’s the system that, on eating ice cream, realizes “oh no! This is a bad action that should not receive reward!” and overrides the reward machinery? How was it trained?
I think when Eliezer says “we need an entirely new paradigm”, he means something like “if we want a decision-making system that makes better decisions that a RL agent, we need agent-finding machinery that’s better than RL.” Maybe the paradigm shift is small (like from RL without experience replay to RL with), or maybe the paradigm shift is large (like from policy-based agents to plan-based agents).
In contrast, I think we can explain humans’ tendency to like ice cream using the standard language of reinforcement learning. It doesn’t require that we adopt an entirely new paradigm before we can even get a handle on such issues.
He’s not saying the failures of RL are a surprise from the theory of RL. Of course you can explain it using the standard language of RL! He’s saying that unless you can predict RL’s failures from the inside, the RL agents that you make are going to actually make those mistakes in reality.
My shard theory inspired story is to make an AI that:
Has a good core of human values (this is still hard)
Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)
Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.
If there are experiences which will change itself which don’t lead to less of the initial good values, then yeah, for an approximate definition of safety. You’re resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don’t really see your description as, like, a specific alignment strategy so much as the strategy of “have an alignment strategy at all”. The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
if it fails before you top out the scaling I think you probably lose
While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
The alignment strategy seems to be “what we’re doing right now” which is:
feed the base model human generated training data
apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.
Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
How much is this central to your story of how things go well?
I agree that humanity could do this (or at least it could if it had it’s shit together), and I think it’s a good target to aim for that buys us sizable successes probability. But I don’t think it’s what’s going to happen by default.
Slower is better obviously but as to the inevitability of ASI, I think reaching top 99% human capabilities in a handful of domains is enough to stop the current race. Getting there is probably not too dangerous.
As an example, taking over most networked computing devices seems feasible in principle with thousands of +2SD AI programmers/security-researchers. That requires an Alpha-go level breakthrough for RL as applied to LLM programmer-agents.
One especially low risk/complexity option is a stealthy takeover of other AI lab’s compute then faking another AI winter. This might get you most of the compute and impact you care about without actively pissing off everyone.
If more confident in jailbreak prevention and software hardening, secrecy is less important.
First mover advantage depends on ability to fix vulnerabilities and harden infrastructure to prevent a second group from taking over. To the extent AI is required for management, jailbreak prevention/mitigation will also be needed.
is your proposal “use the true reward function, and then you won’t get misaligned AI”?
No. I’m not proposing anything here. I’m arguing that Yudkowsky’s ice cream example doesn’t actually illustrate an alignment-relevant failure mode in RL.
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data. From that perspective, the reason humans like ice cream is because they were trained to do so. To prevent AIs from behaving badly due to this particular reason, you can just refrain from training them to behave badly (they may behave badly for other reasons, of course).
I also think evolution is mechanistically very different from deep learning, such that it’s near-useless to try to use evolutionary outcomes as a basis for making predictions about deep learning alignment outcomes.
See my other reply for a longer explanation of my perspective.
Humans are not choosing to reward specific instances of actions of the AI—when we build intelligent agents, at some point they will leave the confines of curated training data and go operate on new experiences in the real world. At that point, their circuitry and rewards are out of human control, so that makes our position perfectly analogous to evolution’s. We are choosing the reward mechanism, not the reward.
Note that this provides an obvious route to alignment using conventional engineering practice.
Why does the AGI system need to update at all “out in the world”. This is highly unreliable. As events happen in the real world that the system doesn’t expect, add the (expectation, ground truth) tuples to a log and then train a simulator on the log from all instances of the system, then train the system on the updated simulator.
So only train in batches and use code in the simulator that “rewards” behavior that accomplishes the intent of the designers.
uh
is your proposal “use the true reward function, and then you won’t get misaligned AI”?
These three paragraphs feel incoherent to me. The human eating ice cream and activating their reward circuits is exactly what you would expect under the current paradigm. Yudkowsky thinks this leads to misalignment; you agree. He says that you need a new paradigm to not have this problem. You disagree because you assume it’s possible under the current paradigm.
If so, how? Where’s the system that, on eating ice cream, realizes “oh no! This is a bad action that should not receive reward!” and overrides the reward machinery? How was it trained?
I think when Eliezer says “we need an entirely new paradigm”, he means something like “if we want a decision-making system that makes better decisions that a RL agent, we need agent-finding machinery that’s better than RL.” Maybe the paradigm shift is small (like from RL without experience replay to RL with), or maybe the paradigm shift is large (like from policy-based agents to plan-based agents).
He’s not saying the failures of RL are a surprise from the theory of RL. Of course you can explain it using the standard language of RL! He’s saying that unless you can predict RL’s failures from the inside, the RL agents that you make are going to actually make those mistakes in reality.
My shard theory inspired story is to make an AI that:
Has a good core of human values (this is still hard)
Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)
Then the model can safely scale.
This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)
Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.
If there are experiences which will change itself which don’t lead to less of the initial good values, then yeah, for an approximate definition of safety. You’re resting everything on the continued strength of this model as capabilities increase, and so if it fails before you top out the scaling I think you probably lose.
FWIW I don’t really see your description as, like, a specific alignment strategy so much as the strategy of “have an alignment strategy at all”. The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
The alignment strategy seems to be “what we’re doing right now” which is:
feed the base model human generated training data
apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.
How much is this central to your story of how things go well?
I agree that humanity could do this (or at least it could if it had it’s shit together), and I think it’s a good target to aim for that buys us sizable successes probability. But I don’t think it’s what’s going to happen by default.
Slower is better obviously but as to the inevitability of ASI, I think reaching top 99% human capabilities in a handful of domains is enough to stop the current race. Getting there is probably not too dangerous.
Stop it how?
Vulnerable world hypothesis (but takeover risk rather than destruction risk). That + first mover advantage could stop things pretty decisively without requiring ASI alignment
As an example, taking over most networked computing devices seems feasible in principle with thousands of +2SD AI programmers/security-researchers. That requires an Alpha-go level breakthrough for RL as applied to LLM programmer-agents.
One especially low risk/complexity option is a stealthy takeover of other AI lab’s compute then faking another AI winter. This might get you most of the compute and impact you care about without actively pissing off everyone.
If more confident in jailbreak prevention and software hardening, secrecy is less important.
First mover advantage depends on ability to fix vulnerabilities and harden infrastructure to prevent a second group from taking over. To the extent AI is required for management, jailbreak prevention/mitigation will also be needed.
No. I’m not proposing anything here. I’m arguing that Yudkowsky’s ice cream example doesn’t actually illustrate an alignment-relevant failure mode in RL.
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data. From that perspective, the reason humans like ice cream is because they were trained to do so. To prevent AIs from behaving badly due to this particular reason, you can just refrain from training them to behave badly (they may behave badly for other reasons, of course).
I also think evolution is mechanistically very different from deep learning, such that it’s near-useless to try to use evolutionary outcomes as a basis for making predictions about deep learning alignment outcomes.
See my other reply for a longer explanation of my perspective.
I’ve replied over there.
Humans are not choosing to reward specific instances of actions of the AI—when we build intelligent agents, at some point they will leave the confines of curated training data and go operate on new experiences in the real world. At that point, their circuitry and rewards are out of human control, so that makes our position perfectly analogous to evolution’s. We are choosing the reward mechanism, not the reward.
Note that this provides an obvious route to alignment using conventional engineering practice.
Why does the AGI system need to update at all “out in the world”. This is highly unreliable. As events happen in the real world that the system doesn’t expect, add the (expectation, ground truth) tuples to a log and then train a simulator on the log from all instances of the system, then train the system on the updated simulator.
So only train in batches and use code in the simulator that “rewards” behavior that accomplishes the intent of the designers.