Davidmanheim comments on How does Gradient Descent Interact with Goodhart?

Davidmanheim 18 Jun 2019 9:26 UTC
1 point
I really like the connection between optimal learning and Goodhart failures, and I’d love to think about / discuss this more. I’ve mostly thought about it in the online case, since we can sample from human preferences iteratively, and build human-in-the-loop systems as I suggested here: https://arxiv.org/abs/1811.09246 “Oversight of Unsafe Systems via Dynamic Safety Envelopes”, which I think parallels, but is less developed than one part of Paul Christiano’s approach, but I see why that’s infeasible in many settings, which is a critical issue that the offline case addresses.
I also want to note that this addresses issues of extremal model insufficiency, and to an extent regressional Goodhart, but not regime change or causal Goodhart.
As an example of the former for human values, I’d suggest that “maximize food intake” is a critical goal in starving humans, but there is a point at which the goal becomes actively harmful, and if all you see are starving humans, you need a fairly complex model of human happiness to notice that. The same regime change applies to sex, and to most other specific desires.
As an example of the latter, causal Goodhart would be where an AI system optimizes for systems that are good at reporting successful space flights, rather than optimizing for actual success—any divergence leads to a system that will kill people and lie about it.
- Vanessa Kosoy 19 Jun 2019 14:18 UTC
  4 points
  Parent
  Hi David, if you want to discuss this more, I think we can do it in person? AFAIK you live in Israel? For example, you can come to my talk in the LessWrong meetup on July 2.
  - Davidmanheim 21 Jun 2019 12:40 UTC
    3 points
    Parent
    Yes, and yes, I’m hoping to be there.