Thanks for this post! Good insights that refined my arguments.
I’ll present three points:
I fully agree that realizability is needed here. In practice, for the research I’m doing, I’m defining the desired utility as being defined by a constructive process. Therefore the correct human preference set is in there by definition. This requires that the set of possible utilities be massive enough that we’re confident we didn’t miss anything. Then, because the process is constructive, it has to be realizable once we’ve defined the “normative assumptions” that map observations to updates of value functions.
One (partial) rejoinder to “the outside world is much more complicated than any probability distribution which we can explicitly use, since we are ourselves a small part of that world” is that our meta-preferences are conditional; we don’t need to fully take into account all future problems we might encounter, we simply have to defined something that, conditional on encountering those problems, will see them as problems (though beware this problem).
Finally, on “A first-pass analysis is that has to be more than 1⁄2 to guarantee any consideration; any weight less than that, and it’s possible that utrue is as low as it can go in the optimized solution”: this is only correct if utrue is roughly linear in resources. If we assume that utrue has steep losses if we lose some key variable (otherwise known as the “value is fragile” assumption), then losses to utrue will likely be limited; see this post. This is the kind of information that you need to include if you have any chance of defeating Goodhart.
I think we agree that Goodhart can be ameliorated by adding this extra information/uncertainty; it’s not clear whether it can be completely resolved.
I fully agree that realizability is needed here. In practice, for the research I’m doing, I’m defining the desired utility as being defined by a constructive process. Therefore the correct human preference set is in there by definition. This requires that the set of possible utilities be massive enough that we’re confident we didn’t miss anything. Then, because the process is constructive, it has to be realizable once we’ve defined the “normative assumptions” that map observations to updates of value functions.
My current intuition is that thinking in terms of non-realizable epistemology will give a more robust construction process, even though the constructive way of thinking justifies a kind of realizability assumption. This is partly because it allows us to do without the massive-enough set of hypotheses (which one may have to do without in practice), but also because it seems closer to the reality of “humans don’t really have a utility function, not exactly”.
However, I think I haven’t sufficiently internalized your point about utility being defined by a constructive process, so my opinion on that may change as I think about it more.
Concerning #3: yeah, I’m currently thinking that you need to make some more assumptions. But, I’m not sure I want to make assumptions about resources. I think there may be useful assumptions related to the way the hypotheses are learned—IE, we expect hypotheses with nontrivial weight to have a lot of agreement because they are candidate generalizations of the same data, which makes it somewhat hard to entirely dissatisfy some while satisfying others. This doesn’t seem quite helpful enough, but, perhaps something in that direction.
In any case, I agree that it seems interesting to explore assumptions about the mutual satisfiability of different value functions.
“resources” is more of shorthand for “the best utility function looks like a smoothmin of a subset of the different features. Given that assumption, the best fuzzy approximation looks like a smoothmin of all the features, with different weights”.
Thanks for this post! Good insights that refined my arguments.
I’ll present three points:
I fully agree that realizability is needed here. In practice, for the research I’m doing, I’m defining the desired utility as being defined by a constructive process. Therefore the correct human preference set is in there by definition. This requires that the set of possible utilities be massive enough that we’re confident we didn’t miss anything. Then, because the process is constructive, it has to be realizable once we’ve defined the “normative assumptions” that map observations to updates of value functions.
One (partial) rejoinder to “the outside world is much more complicated than any probability distribution which we can explicitly use, since we are ourselves a small part of that world” is that our meta-preferences are conditional; we don’t need to fully take into account all future problems we might encounter, we simply have to defined something that, conditional on encountering those problems, will see them as problems (though beware this problem).
Finally, on “A first-pass analysis is that has to be more than 1⁄2 to guarantee any consideration; any weight less than that, and it’s possible that utrue is as low as it can go in the optimized solution”: this is only correct if utrue is roughly linear in resources. If we assume that utrue has steep losses if we lose some key variable (otherwise known as the “value is fragile” assumption), then losses to utrue will likely be limited; see this post. This is the kind of information that you need to include if you have any chance of defeating Goodhart.
I think we agree that Goodhart can be ameliorated by adding this extra information/uncertainty; it’s not clear whether it can be completely resolved.
My current intuition is that thinking in terms of non-realizable epistemology will give a more robust construction process, even though the constructive way of thinking justifies a kind of realizability assumption. This is partly because it allows us to do without the massive-enough set of hypotheses (which one may have to do without in practice), but also because it seems closer to the reality of “humans don’t really have a utility function, not exactly”.
However, I think I haven’t sufficiently internalized your point about utility being defined by a constructive process, so my opinion on that may change as I think about it more.
Concerning #3: yeah, I’m currently thinking that you need to make some more assumptions. But, I’m not sure I want to make assumptions about resources. I think there may be useful assumptions related to the way the hypotheses are learned—IE, we expect hypotheses with nontrivial weight to have a lot of agreement because they are candidate generalizations of the same data, which makes it somewhat hard to entirely dissatisfy some while satisfying others. This doesn’t seem quite helpful enough, but, perhaps something in that direction.
In any case, I agree that it seems interesting to explore assumptions about the mutual satisfiability of different value functions.
“resources” is more of shorthand for “the best utility function looks like a smoothmin of a subset of the different features. Given that assumption, the best fuzzy approximation looks like a smoothmin of all the features, with different weights”.