Jon Garcia comments on All AGI Safety questions welcome (especially basic ones) [~monthly thread]

Jon Garcia 2 Nov 2022 21:24 UTC
1 point
0
Could we solve alignment by just getting an AI to learn human preferences through training it to predict human behavior, using a “current best guess” model of human preferences to make predictions and updating the model until its predictions are accurate, then using this model as a reward signal for the AI? Is there a danger in relying on these sorts of revealed preferences?

On a somewhat related note, someone should answer, “What is this Coherent Extrapolated Volition I’ve been hearing about from the AI safety community? Are there any holes in that plan?”
- TAG 7 Nov 2022 10:55 UTC
  1 point
  0
  Parent
  The main problems with CEV are that there is no method for it in practice, and no proof that it could work in principle.
- mruwnik 3 Nov 2022 10:22 UTC
  1 point
  0
  Parent
  I’m lazy, so I’ll just link to this https://www.lesswrong.com/tag/coherent-extrapolated-volition