is your proposal “use the true reward function, and then you won’t get misaligned AI”?
No. I’m not proposing anything here. I’m arguing that Yudkowsky’s ice cream example doesn’t actually illustrate an alignment-relevant failure mode in RL.
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data. From that perspective, the reason humans like ice cream is because they were trained to do so. To prevent AIs from behaving badly due to this particular reason, you can just refrain from training them to behave badly (they may behave badly for other reasons, of course).
I also think evolution is mechanistically very different from deep learning, such that it’s near-useless to try to use evolutionary outcomes as a basis for making predictions about deep learning alignment outcomes.
See my other reply for a longer explanation of my perspective.
Humans are not choosing to reward specific instances of actions of the AI—when we build intelligent agents, at some point they will leave the confines of curated training data and go operate on new experiences in the real world. At that point, their circuitry and rewards are out of human control, so that makes our position perfectly analogous to evolution’s. We are choosing the reward mechanism, not the reward.
Note that this provides an obvious route to alignment using conventional engineering practice.
Why does the AGI system need to update at all “out in the world”. This is highly unreliable. As events happen in the real world that the system doesn’t expect, add the (expectation, ground truth) tuples to a log and then train a simulator on the log from all instances of the system, then train the system on the updated simulator.
So only train in batches and use code in the simulator that “rewards” behavior that accomplishes the intent of the designers.
No. I’m not proposing anything here. I’m arguing that Yudkowsky’s ice cream example doesn’t actually illustrate an alignment-relevant failure mode in RL.
I think we have different perspectives on what counts as “training” in the case of human evolution. I think of human within lifetime experiences as the training data, and I don’t include the evolutionary history in the training data. From that perspective, the reason humans like ice cream is because they were trained to do so. To prevent AIs from behaving badly due to this particular reason, you can just refrain from training them to behave badly (they may behave badly for other reasons, of course).
I also think evolution is mechanistically very different from deep learning, such that it’s near-useless to try to use evolutionary outcomes as a basis for making predictions about deep learning alignment outcomes.
See my other reply for a longer explanation of my perspective.
I’ve replied over there.
Humans are not choosing to reward specific instances of actions of the AI—when we build intelligent agents, at some point they will leave the confines of curated training data and go operate on new experiences in the real world. At that point, their circuitry and rewards are out of human control, so that makes our position perfectly analogous to evolution’s. We are choosing the reward mechanism, not the reward.
Note that this provides an obvious route to alignment using conventional engineering practice.
Why does the AGI system need to update at all “out in the world”. This is highly unreliable. As events happen in the real world that the system doesn’t expect, add the (expectation, ground truth) tuples to a log and then train a simulator on the log from all instances of the system, then train the system on the updated simulator.
So only train in batches and use code in the simulator that “rewards” behavior that accomplishes the intent of the designers.