the core problems of alignment have to be solved via something more than just feedback.
No. I strongly disagree, assuming you mean “feedback signals” to include “reward signals.” The feedback signal is not the optimization target. The point of the feedback signal is not to be safely maximizable. The point of a feedback signal is to supply cognitive-updates to the network/agent. If the cognitive-updates grow human-aligned cognitive patterns which govern the AI’s behavior, we have built an aligned agent.
For example, suppose that I penalize the agent whenever I catch it lying. Then credit assignment de-emphasizes certain cognitive patterns which produced those outputs, and—if there are exact gradients to alternative actions—emphasizes or fine-tunes new lines of computation which would have produced the alternative actions in that situation. Concretely, I ask the AI whether it hates dogs, and it says “yes”, and then I ask it whether it admitted to hating dogs, and it says “no.”
Perhaps the AI had initially lied due to its pretrained initialization predicting that a human would have lied in that context, but then that reasoning gets penalized by credit assignment when I catch the AI lying. The reinforcement tweaks the AI to be less likely to lie in similar situations. Perhaps it learns “If a human would lie, then be honest.” Perhaps it learns some totally alien other thing. But importantly, the AI is not necessarily optimizing for high reward—the AI is being reconfigured by the reinforcement signals.
I think the key question of alignment is: How do we provide reinforcement signals so as to reliably reinforce and grow certain kinds of cognition within an AI? Asking after feedback signals which don’t “fail horribly under sufficient optimization pressure” misses this more interesting and relevant question.
Straw person: We haven’t found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn’t one.
Ramana’s gloss of TurnTrout: But AIs don’t maximise their feedback. The feedback is just input to the algorithm that shapes the AI’s cognition. This cognition may then go on to in effect “have a world model” and “pursue something” in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won’t be pursuing high feedback. (Also, it might just do something else entirely.)
Less straw person: Yeah I get that. But what kind of cognition do you actually get after shaping it with a lot of feedback? (i.e., optimising/selecting the cognition based on its performance at feedback maximisation) If your optimiser worked, then you get something that pursues positive feedback. Spelling things out, what you get will have a world model that includes the feedback producer, and it will pursue real high feedback, as long as doing so is a possible mind configuration and the optimiser can find it, since that will in fact maximise the optimisation objective.
Possible TurnTrout response: We’re obviously not going to be using “argmax” as the optimiser though.
Addendum: I think that this reasoning fails on the single example we have of general intelligence (i.e. human beings). People probably do value “positive feedback” (in terms of reward prediction error or some tight correlate thereof), but people are not generally reward optimizers.
I think perhaps a lot work is being done by “if your optimiser worked”. This might also be where there’s a disanaology between humans<->evolution and AIs<->SGD+PPO (or whatever RL algorithm you’re using to optimise the policy). Maybe evolution is actually a very weak optimiser, that doesn’t really “work”, compared to SGD+RL.
I think the way I’d fit that into my ontology is “the reward signal is not the relevant feedback signal (for purposes of this argument)”. The relevant feedback signal is whatever some human looks at, at the end of the day, to notice when there’s problems or to tell how well the AI is doing by the human’s standards. It’s how we (human designers/operators) notice the problems on which to iterate. It’s whatever the designer is implicitly optimizing for, in the long run, by developing an AI via the particular process the designer is using.
If the human is just using the reward signal as a control interface for steering the AI’s internals, then the reward signal is not the feedback signal to which this argument applies.
We discussed more in person. I ended up agreeing with (what I perceive to be) a substantially different claim than I read from your original comment. I agree that we can’t just figure out alignment by black-boxing AI cognition and seeing whether the AI does good things or not, nor can we just set up feedback loops on that (e.g. train a succession of agents and tweak the process based on how aligned they seem) without some substantial theoretical underpinnings with which to interpret the evidence.
However, I still don’t see how your original comment is a reasonable way to communicate this state of mind. For example, you wrote:
It’s easy to come up with a crappy proxy feedback signal—just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.
What does this mean, if not using human approval as a reward signal? Can you briefly step me through a fictional scenario where the described failure obtains?
It’s easy to come up with a crappy proxy feedback signal—just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.
No. I strongly disagree, assuming you mean “feedback signals” to include “reward signals.” The feedback signal is not the optimization target. The point of the feedback signal is not to be safely maximizable. The point of a feedback signal is to supply cognitive-updates to the network/agent. If the cognitive-updates grow human-aligned cognitive patterns which govern the AI’s behavior, we have built an aligned agent.
For example, suppose that I penalize the agent whenever I catch it lying. Then credit assignment de-emphasizes certain cognitive patterns which produced those outputs, and—if there are exact gradients to alternative actions—emphasizes or fine-tunes new lines of computation which would have produced the alternative actions in that situation. Concretely, I ask the AI whether it hates dogs, and it says “yes”, and then I ask it whether it admitted to hating dogs, and it says “no.”
Perhaps the AI had initially lied due to its pretrained initialization predicting that a human would have lied in that context, but then that reasoning gets penalized by credit assignment when I catch the AI lying. The reinforcement tweaks the AI to be less likely to lie in similar situations. Perhaps it learns “If a human would lie, then be honest.” Perhaps it learns some totally alien other thing. But importantly, the AI is not necessarily optimizing for high reward—the AI is being reconfigured by the reinforcement signals.
I think the key question of alignment is: How do we provide reinforcement signals so as to reliably reinforce and grow certain kinds of cognition within an AI? Asking after feedback signals which don’t “fail horribly under sufficient optimization pressure” misses this more interesting and relevant question.
Straw person: We haven’t found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn’t one.
Ramana’s gloss of TurnTrout: But AIs don’t maximise their feedback. The feedback is just input to the algorithm that shapes the AI’s cognition. This cognition may then go on to in effect “have a world model” and “pursue something” in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won’t be pursuing high feedback. (Also, it might just do something else entirely.)
Less straw person: Yeah I get that. But what kind of cognition do you actually get after shaping it with a lot of feedback? (i.e., optimising/selecting the cognition based on its performance at feedback maximisation) If your optimiser worked, then you get something that pursues positive feedback. Spelling things out, what you get will have a world model that includes the feedback producer, and it will pursue real high feedback, as long as doing so is a possible mind configuration and the optimiser can find it, since that will in fact maximise the optimisation objective.
Possible TurnTrout response: We’re obviously not going to be using “argmax” as the optimiser though.
Thanks for running a model of me :)
Actual TurnTrout response: No.
Addendum: I think that this reasoning fails on the single example we have of general intelligence (i.e. human beings). People probably do value “positive feedback” (in terms of reward prediction error or some tight correlate thereof), but people are not generally reward optimizers.
I think perhaps a lot work is being done by “if your optimiser worked”. This might also be where there’s a disanaology between humans<->evolution and AIs<->SGD+PPO (or whatever RL algorithm you’re using to optimise the policy). Maybe evolution is actually a very weak optimiser, that doesn’t really “work”, compared to SGD+RL.
I think that evolution is not the relevant optimizer for humans in this situation. Instead consider the within-lifetime learning that goes on in human brains. Humans are very probably reinforcement learning agents in a relevant sense; in some ways, humans are the best reinforcement learning agents we have ever seen.
I think the way I’d fit that into my ontology is “the reward signal is not the relevant feedback signal (for purposes of this argument)”. The relevant feedback signal is whatever some human looks at, at the end of the day, to notice when there’s problems or to tell how well the AI is doing by the human’s standards. It’s how we (human designers/operators) notice the problems on which to iterate. It’s whatever the designer is implicitly optimizing for, in the long run, by developing an AI via the particular process the designer is using.
If the human is just using the reward signal as a control interface for steering the AI’s internals, then the reward signal is not the feedback signal to which this argument applies.
We discussed more in person. I ended up agreeing with (what I perceive to be) a substantially different claim than I read from your original comment. I agree that we can’t just figure out alignment by black-boxing AI cognition and seeing whether the AI does good things or not, nor can we just set up feedback loops on that (e.g. train a succession of agents and tweak the process based on how aligned they seem) without some substantial theoretical underpinnings with which to interpret the evidence.
However, I still don’t see how your original comment is a reasonable way to communicate this state of mind. For example, you wrote:
What does this mean, if not using human approval as a reward signal? Can you briefly step me through a fictional scenario where the described failure obtains?
Hm.
Now I don’t understand why this will obviously fail horribly, if your argument doesn’t apply to reward signals. How does human approval fail horribly when used in RL training?