Charlie Steiner comments on Defeating Goodhart and the “closest unblocked strategy” problem

Charlie Steiner 3 Apr 2019 22:07 UTC
LW: 1 AF: 1
AF
One further issue is that if the AI deduces this within one human-model (as in CIRL), it may follow this model off a metaphorical cliff when trying to maximize modeled reward.

Merely expanding the family of models isn’t enough because the best-predicting model is something like a microscopic, non-intentional model of the human. A “nearest unblocked model” problem. The solution should be similar—get the AI to score models so that the sort of model we want it to use is scored highly. (Or perhaps more complicated where human morality is undefined.) This isn’t just a prior—we want predictive quality to only be one of several (as yet ill-defined) criteria.