I really like the connection between optimal learning and Goodhart failures, and I’d love to think about / discuss this more. I’ve mostly thought about it in the online case, since we can sample from human preferences iteratively, and build human-in-the-loop systems as I suggested here: https://arxiv.org/abs/1811.09246 “Oversight of Unsafe Systems via Dynamic Safety Envelopes”, which I think parallels, but is less developed than one part of Paul Christiano’s approach, but I see why that’s infeasible in many settings, which is a critical issue that the offline case addresses.
I also want to note that this addresses issues of extremal model insufficiency, and to an extent regressional Goodhart, but not regime change or causal Goodhart.
As an example of the former for human values, I’d suggest that “maximize food intake” is a critical goal in starving humans, but there is a point at which the goal becomes actively harmful, and if all you see are starving humans, you need a fairly complex model of human happiness to notice that. The same regime change applies to sex, and to most other specific desires.
As an example of the latter, causal Goodhart would be where an AI system optimizes for systems that are good at reporting successful space flights, rather than optimizing for actual success—any divergence leads to a system that will kill people and lie about it.
Hi David, if you want to discuss this more, I think we can do it in person? AFAIK you live in Israel? For example, you can come to my talk in the LessWrong meetup on July 2.
I really like the connection between optimal learning and Goodhart failures, and I’d love to think about / discuss this more. I’ve mostly thought about it in the online case, since we can sample from human preferences iteratively, and build human-in-the-loop systems as I suggested here: https://arxiv.org/abs/1811.09246 “Oversight of Unsafe Systems via Dynamic Safety Envelopes”, which I think parallels, but is less developed than one part of Paul Christiano’s approach, but I see why that’s infeasible in many settings, which is a critical issue that the offline case addresses.
I also want to note that this addresses issues of extremal model insufficiency, and to an extent regressional Goodhart, but not regime change or causal Goodhart.
As an example of the former for human values, I’d suggest that “maximize food intake” is a critical goal in starving humans, but there is a point at which the goal becomes actively harmful, and if all you see are starving humans, you need a fairly complex model of human happiness to notice that. The same regime change applies to sex, and to most other specific desires.
As an example of the latter, causal Goodhart would be where an AI system optimizes for systems that are good at reporting successful space flights, rather than optimizing for actual success—any divergence leads to a system that will kill people and lie about it.
Hi David, if you want to discuss this more, I think we can do it in person? AFAIK you live in Israel? For example, you can come to my talk in the LessWrong meetup on July 2.
Yes, and yes, I’m hoping to be there.