Charlie Steiner comments on Validator models: A simple approach to detecting goodharting

Charlie Steiner 21 Feb 2023 3:33 UTC
2 points
0
Have you read Reducing Goodhart? A relevant thesis is that goodharting on human values is not just like overfitting in supervised learning. (The goal of value learning isn’t just to replicate what human supervisors would tell the AI to do. It’s to generalize what the humans would say according to processes that the humans also approve of.)
I think we’re unlikely to get good generalization properties just from NN regularization (or a simplicity prior etc). Better to build an AI that’s actually trying to generalize how humans want. I think testing your value learning AI against a validation set of human responses is a totally reasonable way to test how well it can learn to model humans, and if it fails it’s probably not going to to a very good job, but just because it succeeds doesn’t mean it has the generalization properties we want, even if the validation set is arbitrarily thorough.