John_Maxwell comments on Defeating Goodhart and the “closest unblocked strategy” problem

John_Maxwell 3 Apr 2019 22:56 UTC
4 points
Glad you are thinking along these lines. Personally, I would go even further to use existing ML concepts in the implementation of this idea. Instead of explicitly stating W as our current best estimate for U, provide the system with a labeled dataset about human preferences, using soft labels (probabilities that aren’t 0 or 1) instead of hard labels, to better communicate our uncertainty. Have the system use active learning to identify examples such that getting a label for those examples would be highly informative for its model. Use cross-validation to figure out which modeling strategies generalize with calibrated probability estimates most effectively. I’m pretty sure there are also machine learning techniques for identifying examples which have a high probability of being mislabeled, or examples that are especially pivotal to the system’s model of the world, so that could be used to surface particular examples so the human overseer could give them a second look. (If such techniques don’t exist already I don’t think it would be hard to develop them.)
- Stuart_Armstrong 4 Apr 2019 8:26 UTC
  2 points
  Parent
  Yep, those could work as well. I’m most worried about human errors/uncertainties on distribution shifts (ie we write out a way of dealing with distribution shifts, but don’t correctly include our uncertainty about the writeup).
  - John_Maxwell 4 Apr 2019 21:26 UTC
    2 points
    Parent
    It’s uncertainty all the way down. This is where recursive self-improvement comes in handy.