Straw person: We haven’t found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn’t one.
Ramana’s gloss of TurnTrout: But AIs don’t maximise their feedback. The feedback is just input to the algorithm that shapes the AI’s cognition. This cognition may then go on to in effect “have a world model” and “pursue something” in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won’t be pursuing high feedback. (Also, it might just do something else entirely.)
Less straw person: Yeah I get that. But what kind of cognition do you actually get after shaping it with a lot of feedback? (i.e., optimising/selecting the cognition based on its performance at feedback maximisation) If your optimiser worked, then you get something that pursues positive feedback. Spelling things out, what you get will have a world model that includes the feedback producer, and it will pursue real high feedback, as long as doing so is a possible mind configuration and the optimiser can find it, since that will in fact maximise the optimisation objective.
Possible TurnTrout response: We’re obviously not going to be using “argmax” as the optimiser though.
Addendum: I think that this reasoning fails on the single example we have of general intelligence (i.e. human beings). People probably do value “positive feedback” (in terms of reward prediction error or some tight correlate thereof), but people are not generally reward optimizers.
I think perhaps a lot work is being done by “if your optimiser worked”. This might also be where there’s a disanaology between humans<->evolution and AIs<->SGD+PPO (or whatever RL algorithm you’re using to optimise the policy). Maybe evolution is actually a very weak optimiser, that doesn’t really “work”, compared to SGD+RL.
Straw person: We haven’t found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn’t one.
Ramana’s gloss of TurnTrout: But AIs don’t maximise their feedback. The feedback is just input to the algorithm that shapes the AI’s cognition. This cognition may then go on to in effect “have a world model” and “pursue something” in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won’t be pursuing high feedback. (Also, it might just do something else entirely.)
Less straw person: Yeah I get that. But what kind of cognition do you actually get after shaping it with a lot of feedback? (i.e., optimising/selecting the cognition based on its performance at feedback maximisation) If your optimiser worked, then you get something that pursues positive feedback. Spelling things out, what you get will have a world model that includes the feedback producer, and it will pursue real high feedback, as long as doing so is a possible mind configuration and the optimiser can find it, since that will in fact maximise the optimisation objective.
Possible TurnTrout response: We’re obviously not going to be using “argmax” as the optimiser though.
Thanks for running a model of me :)
Actual TurnTrout response: No.
Addendum: I think that this reasoning fails on the single example we have of general intelligence (i.e. human beings). People probably do value “positive feedback” (in terms of reward prediction error or some tight correlate thereof), but people are not generally reward optimizers.
I think perhaps a lot work is being done by “if your optimiser worked”. This might also be where there’s a disanaology between humans<->evolution and AIs<->SGD+PPO (or whatever RL algorithm you’re using to optimise the policy). Maybe evolution is actually a very weak optimiser, that doesn’t really “work”, compared to SGD+RL.
I think that evolution is not the relevant optimizer for humans in this situation. Instead consider the within-lifetime learning that goes on in human brains. Humans are very probably reinforcement learning agents in a relevant sense; in some ways, humans are the best reinforcement learning agents we have ever seen.