20. (...) To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.
So, I’m thinking this is a critique of some proposals to teach an AI ethics by having it be co-trained with humans.
There seems to be many obvious solutions to the problem of there being lots of people who won’t answer correctly to “Point out any squares of people behaving badly” or “Point out any squares of people acting against their self-interest” etc:
- make the AIs model expect more random errors— after having noticed some responders as giving better answers, give their answers more weight - limit the number of people that will co-train the AI
I work on AI safety via learning from human feedback. In response to your three ideas:
Uniformly random human noise actually isn’t much of a problem. It becomes a problem when the human noise is systematically biased in some way, and the AI doesn’t know exactly what that bias is. Another core problem (which overlaps with the human bias), is that the AI must use a model of human decision-making to back out human values from human feedback/behavior/interaction, etc. If this model is wrong, even slightly (for example, the AI doesn’t realize that the noise is biased along one axis), the AI can infer incorrect human values.
I’m working on it, stay tuned.
Our most capable AI systems require a LOT of training data, and it’s already expensive to generate enough human feedback for training. Limiting the pool of human teachers to trusted experts, or providing pre-training to all of the teachers, would make this even more expensive. One possible way out of this is to train AI systems themselves to give feedback, in imitation of a small trusted set of human teachers.
So, I’m thinking this is a critique of some proposals to teach an AI ethics by having it be co-trained with humans.
There seems to be many obvious solutions to the problem of there being lots of people who won’t answer correctly to “Point out any squares of people behaving badly” or “Point out any squares of people acting against their self-interest” etc:
- make the AIs model expect more random errors—
after having noticed some responders as giving better answers, give their answers more weight
- limit the number of people that will co-train the AI
What’s the problem with these ideas?
I work on AI safety via learning from human feedback. In response to your three ideas:
Uniformly random human noise actually isn’t much of a problem. It becomes a problem when the human noise is systematically biased in some way, and the AI doesn’t know exactly what that bias is. Another core problem (which overlaps with the human bias), is that the AI must use a model of human decision-making to back out human values from human feedback/behavior/interaction, etc. If this model is wrong, even slightly (for example, the AI doesn’t realize that the noise is biased along one axis), the AI can infer incorrect human values.
I’m working on it, stay tuned.
Our most capable AI systems require a LOT of training data, and it’s already expensive to generate enough human feedback for training. Limiting the pool of human teachers to trusted experts, or providing pre-training to all of the teachers, would make this even more expensive. One possible way out of this is to train AI systems themselves to give feedback, in imitation of a small trusted set of human teachers.