Could we solve alignment by just getting an AI to learn human preferences through training it to predict human behavior, using a “current best guess” model of human preferences to make predictions and updating the model until its predictions are accurate, then using this model as a reward signal for the AI? Is there a danger in relying on these sorts of revealed preferences?
On a somewhat related note, someone should answer, “What is this Coherent Extrapolated Volition I’ve been hearing about from the AI safety community? Are there any holes in that plan?”
Could we solve alignment by just getting an AI to learn human preferences through training it to predict human behavior, using a “current best guess” model of human preferences to make predictions and updating the model until its predictions are accurate, then using this model as a reward signal for the AI? Is there a danger in relying on these sorts of revealed preferences?
On a somewhat related note, someone should answer, “What is this Coherent Extrapolated Volition I’ve been hearing about from the AI safety community? Are there any holes in that plan?”
The main problems with CEV are that there is no method for it in practice, and no proof that it could work in principle.
I’m lazy, so I’ll just link to this https://www.lesswrong.com/tag/coherent-extrapolated-volition