I appreciate how much detail you’ve used to lay out why you think a lack of human agency is a problem—compared to our earlier conversations, I now have a better sense of what concrete problem you’re trying to solve and why that problem might be important. I can imagine that, e.g., it’s quite difficult to tell how well you’ve fit a curve if the context in which you’re supposed to fit that curve is vulnerable to being changed in ways whose goodness or badness is difficult to specify. I look forward to reading the later posts in this sequence so that I can get a sense of exactly what technical problems are arising and how serious they are.
That said, until I see a specific technical problem that seems really threatening, I’m sticking by my opinion that it’s OK that human preferences vary with human environments, so long as (a) we have a coherent set of preferences for each individual environment, and (b) we have a coherent set of preferences about which environments we would like to be in. Right, like, in the ancestral environment I prefer to eat apples, in the modern environment I prefer to eat Doritos, and in the transhuman environment I prefer to eat simulated wafers that trigger artificial bliss. That’s fine; just make sure to check what environment I’m in before feeding me, and then select the correct food based on my environment. What do you do if you have control over my environment? No big deal, just put me in my preferred environment, which is the transhuman environment.
What happens if my preferred environment depends on the environment I’m currently inhabiting, e.g., modern me wants to migrate to the transhumanist environment, but ancestral me thinks you’re scary and just wants you to go away and leave me alone? Well, that’s an inconsistency in my preferences—but it’s no more or less problematic than any other inconsistency. If I prefer oranges when I’m holding an apple, but I prefer apples when I’m holding an orange, that’s just as annoying as the environment problem. We do need a technique for resolving problems of utility that are sensitive to initial conditions when those initial conditions appear arbitrary, but we need that technique anyway—it’s not some special feature of humans that makes that technique necessary; any beings with any type of varying preferences would need that technique in order to have their utility fully optimized.
It’s certainly worth noting that standard solutions to Goodhart’s law won’t work without modification, because human preferences vary with their environments—but at the moment such modifications seem extremely feasible to me. I don’t understand why your objections are meant to be fatal to the utility of the overall framework of Goodhart’s Law, and I hope you’ll explain that in the next post.
I don’t agree it’s true that we have a coherent set of preferences for each environment.
I’m sure we can agree that humans don’t have their utility function written down in FORTRAN on the inside of our skulls. Nor does our brain store a real number associated with each possible state of the universe (and even if we did, by what lights would we call that number a utility function?).
So when we talk about a human’s preferences in some environment, we’re not talking about opening them up and looking at their brain, we’re talking how humans have this propensity to take reasonable actions that make sense in terms of preferences. Example: You say “would you like doritos or an apple?” and I say “apple,” and then you use this behavior to update your model of my preferences.
But this action-propensity that humans have is sometimes irrational (bold claim I know) and not so easily modeled as a utility function, even within a single environment.
The scheme you talk about for building up human values seems to have a recursive character to it: you get the bigger, broader human utility function by building it out of smaller, more local human utility functions, and so on, until at some base level of recursion there are utility functions that get directly inferred from facts about the human. But unless there’s some level of human action where we act like rational utility maximizers, this base level already contains the problems I’m talking about, and since it’s the base level those problems can’t be resolved or explained by recourse to a yet-baser level.
Different people have different responses to this problem, and I think it’s legitimate to say “well, just get better at inferring utility functions” (though this requires some actual work at specifying a “better”). But I’m going to end up arguing that we should just get better at dealing with models of preferences that aren’t utility functions.
I appreciate how much detail you’ve used to lay out why you think a lack of human agency is a problem—compared to our earlier conversations, I now have a better sense of what concrete problem you’re trying to solve and why that problem might be important. I can imagine that, e.g., it’s quite difficult to tell how well you’ve fit a curve if the context in which you’re supposed to fit that curve is vulnerable to being changed in ways whose goodness or badness is difficult to specify. I look forward to reading the later posts in this sequence so that I can get a sense of exactly what technical problems are arising and how serious they are.
That said, until I see a specific technical problem that seems really threatening, I’m sticking by my opinion that it’s OK that human preferences vary with human environments, so long as (a) we have a coherent set of preferences for each individual environment, and (b) we have a coherent set of preferences about which environments we would like to be in. Right, like, in the ancestral environment I prefer to eat apples, in the modern environment I prefer to eat Doritos, and in the transhuman environment I prefer to eat simulated wafers that trigger artificial bliss. That’s fine; just make sure to check what environment I’m in before feeding me, and then select the correct food based on my environment. What do you do if you have control over my environment? No big deal, just put me in my preferred environment, which is the transhuman environment.
What happens if my preferred environment depends on the environment I’m currently inhabiting, e.g., modern me wants to migrate to the transhumanist environment, but ancestral me thinks you’re scary and just wants you to go away and leave me alone? Well, that’s an inconsistency in my preferences—but it’s no more or less problematic than any other inconsistency. If I prefer oranges when I’m holding an apple, but I prefer apples when I’m holding an orange, that’s just as annoying as the environment problem. We do need a technique for resolving problems of utility that are sensitive to initial conditions when those initial conditions appear arbitrary, but we need that technique anyway—it’s not some special feature of humans that makes that technique necessary; any beings with any type of varying preferences would need that technique in order to have their utility fully optimized.
It’s certainly worth noting that standard solutions to Goodhart’s law won’t work without modification, because human preferences vary with their environments—but at the moment such modifications seem extremely feasible to me. I don’t understand why your objections are meant to be fatal to the utility of the overall framework of Goodhart’s Law, and I hope you’ll explain that in the next post.
Thanks for the comment :)
I don’t agree it’s true that we have a coherent set of preferences for each environment.
I’m sure we can agree that humans don’t have their utility function written down in FORTRAN on the inside of our skulls. Nor does our brain store a real number associated with each possible state of the universe (and even if we did, by what lights would we call that number a utility function?).
So when we talk about a human’s preferences in some environment, we’re not talking about opening them up and looking at their brain, we’re talking how humans have this propensity to take reasonable actions that make sense in terms of preferences. Example: You say “would you like doritos or an apple?” and I say “apple,” and then you use this behavior to update your model of my preferences.
But this action-propensity that humans have is sometimes irrational (bold claim I know) and not so easily modeled as a utility function, even within a single environment.
The scheme you talk about for building up human values seems to have a recursive character to it: you get the bigger, broader human utility function by building it out of smaller, more local human utility functions, and so on, until at some base level of recursion there are utility functions that get directly inferred from facts about the human. But unless there’s some level of human action where we act like rational utility maximizers, this base level already contains the problems I’m talking about, and since it’s the base level those problems can’t be resolved or explained by recourse to a yet-baser level.
Different people have different responses to this problem, and I think it’s legitimate to say “well, just get better at inferring utility functions” (though this requires some actual work at specifying a “better”). But I’m going to end up arguing that we should just get better at dealing with models of preferences that aren’t utility functions.