One major difference between my approach and the linked approach is that I think it’s better to apply the counterfactuals to human values, rather than to the AI’s values. Also, I think changing the utilities over time is confusing and likely to lead to bugs; if I had to address the problem of value learning, I would do something along the lines of the following:
Pick some distribution A of utility functions that you wish to be aligned with. Then optimize:
U=Ev∼A[v|do(H=v)]
… where H is the model of the human preferences. Implementation-wise, this corresponds to simulating the AI in a variety of environments where human preferences are sampled from A, and then in each environment judging the AI by how well it did according to the preferences sampled in said environment.
This would induce the AI to be sensitive to human preferences, as in order to succeed, it’s policy has to observe human preferences from behavior and adjust accordingly. However, I have a hard time seeing this working in practice, because it’s very dubious that human values (insofar as they exist, which is also questionable) can be inferred from behavior. I’m much more comfortable applying the counterfactual to simple concrete behaviors than to big abstract behaviors, as the former seems more predictable.
Yup, very similar. See e.g. https://www.lesswrong.com/posts/btLPgsGzwzDk9DgJG/proper-value-learning-through-indifference
There’s lots of literature out there.
One major difference between my approach and the linked approach is that I think it’s better to apply the counterfactuals to human values, rather than to the AI’s values. Also, I think changing the utilities over time is confusing and likely to lead to bugs; if I had to address the problem of value learning, I would do something along the lines of the following:
Pick some distribution A of utility functions that you wish to be aligned with. Then optimize:
U=Ev∼A[v|do(H=v)]
… where H is the model of the human preferences. Implementation-wise, this corresponds to simulating the AI in a variety of environments where human preferences are sampled from A, and then in each environment judging the AI by how well it did according to the preferences sampled in said environment.
This would induce the AI to be sensitive to human preferences, as in order to succeed, it’s policy has to observe human preferences from behavior and adjust accordingly. However, I have a hard time seeing this working in practice, because it’s very dubious that human values (insofar as they exist, which is also questionable) can be inferred from behavior. I’m much more comfortable applying the counterfactual to simple concrete behaviors than to big abstract behaviors, as the former seems more predictable.