Humans don’t have our values written in Fortran on the inside of our skulls, we’re collections of atoms that only do agent-like things within a narrow band of temperatures and pressures. It’s not that there’s some pre-theoretic set of True Values hidden inside people and we’re merely having trouble getting to them—no, extracting any values at all from humans is a theory-laden act of inference, relying on choices like “which atoms exactly count as part of the person” and “what do you do if the person says different things at different times?”
The natural framing of Goodhart’s law—in both mathematics and casual language—makes the assumption that there’s some specific True Values in here, some V to compare to U. But this assumption, and the way of thinking built on top of it, is crucially false when you get down to the nitty gritty of how to model humans and infer their values..
Having just read this sequence (or for some of the posts in it, reread them), I endorse it too: it’s excellent.
It covers a lot of the same ground as my post Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) in a more leisurely and discursive way, and I think ends up in about the same place: Value Learning isn’t about locating the One True Unified Utility Function that is the True Name of happiness and can thus be safely strongly optimized, it’s about treating researching human values like any other a STEM-like soft science field, and doing the same sorts of cautious, Bayesian, experimental things that we do in any scientific/technical/engineering effort, and avoiding Goodharting by being cautious enough not to trust models (of human values, or anything else) outside their experimentally-supported range of validity, like any sensible STE practitioner. So use all of STEM, don’t only think like a mathematician.
I still endorse my Reducing Goodhart sequence
Having just read this sequence (or for some of the posts in it, reread them), I endorse it too: it’s excellent.
It covers a lot of the same ground as my post Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) in a more leisurely and discursive way, and I think ends up in about the same place: Value Learning isn’t about locating the One True Unified Utility Function that is the True Name of happiness and can thus be safely strongly optimized, it’s about treating researching human values like any other a STEM-like soft science field, and doing the same sorts of cautious, Bayesian, experimental things that we do in any scientific/technical/engineering effort, and avoiding Goodharting by being cautious enough not to trust models (of human values, or anything else) outside their experimentally-supported range of validity, like any sensible STE practitioner. So use all of STEM, don’t only think like a mathematician.