This prediction seems flatly wrong: I wouldn’t bring about an outcome like that. Why do I believe that? Because I have reasonably high-fidelity access to my own policy, via imagining myself in the relevant situations.
This seems like you’re confusing two things here, because the thing you would want is not knowable by introspection. What I think you’re introspecting is that if you’d noticed that the-thing-you-pursued-so-far was different from what your brother actually wants, you’d do what he actually wants. But the-thing-you-pursued-so-far doesn’t play the role of “your utility function” in the goodhart argument. All of you plays into that. If the goodharting were to play out, your detector for differences between the-thing-you-pursued-so-far and what-your-brother-actually-wants would simply fail to warn you that it was happening, because it too can only use a proxy measure for the real thing.
I want to know whether, as a matter of falsifiable fact, I would enact good outcomes by my brother’s values were I very powerful and smart. You seem to be sympathetic to the falsifiable-in-principle prediction that, no, I would not. (Is that true?)
Anyways, I don’t really buy this counterargument, but we can consider the following variant (from footnote 2):
We can also swap out “I bring about a good future for my brother” with “my brother brings about a good future for me, and I think that he will do a good job of it, even though he presumably doesn’t contain a ‘perfect’ motivational pointer to my true values.”
“True” values: My own (which I have access to)
“Proxy” values: My brother’s model of my values (I have a model of his model of my values, as part of the package deal by which I have a model of him)
I still predict that he would bring about a good future by my values. Unless you think my predictive model is wrong? I could ask him to introspect on this scenario and get evidence about what he would do?
That prediction may be true. My argument is that “I know this by introspection” (or, introspection-and-generalization-to-others) is insufficient. For a concrete example, consider your 5-year-old self. I remember some pretty definite beliefs I had about my future self that turned out wrong, and if I ask myself how aligned I am with it I don’t even know how to answer, he just seems way too confused and incoherent.
I think it’s also not absurd that you do have perfect caring in the sense relevant to the argument. This does not require that you don’t make mistakes currently. If you can, with increasing intelligence/information, correct yourself, then the pointer is perfect in the relevant sense. “Caring about the values of person X” is relatively simple and may come out of evolution whereas “those values directly” may not.
My short answer: Violations of the IID assumption is the likeliest problem in trying to generalize your values, and I see this as the key flaw underlying the post.
Specifically, it means that you have to deal with generalizing your values to new situations, but without the IID assumption, you can’t just interpolate from existing values anymore, and you will likely overfit to your IID data points, and that’s the better case. In other words, your behavior will be dominated by your inductive biases and priors. And my fear is that given real life examples of intelligence differences that violate IID distributions, things end up misaligned really fast. I’m not saying that we are doomed, but I want to call this out since I think breaking IID will most likely cause Turner to do something really bad to his brother if we allow even one order of magnitude more compute.
Scale this up to human civilization relying on IID distributions in intelligence, and I’m much more careful than Turner is in trying to extrapolate.
This seems like you’re confusing two things here, because the thing you would want is not knowable by introspection. What I think you’re introspecting is that if you’d noticed that the-thing-you-pursued-so-far was different from what your brother actually wants, you’d do what he actually wants. But the-thing-you-pursued-so-far doesn’t play the role of “your utility function” in the goodhart argument. All of you plays into that. If the goodharting were to play out, your detector for differences between the-thing-you-pursued-so-far and what-your-brother-actually-wants would simply fail to warn you that it was happening, because it too can only use a proxy measure for the real thing.
I want to know whether, as a matter of falsifiable fact, I would enact good outcomes by my brother’s values were I very powerful and smart. You seem to be sympathetic to the falsifiable-in-principle prediction that, no, I would not. (Is that true?)
Anyways, I don’t really buy this counterargument, but we can consider the following variant (from footnote 2):
“True” values: My own (which I have access to)
“Proxy” values: My brother’s model of my values (I have a model of his model of my values, as part of the package deal by which I have a model of him)
I still predict that he would bring about a good future by my values. Unless you think my predictive model is wrong? I could ask him to introspect on this scenario and get evidence about what he would do?
That prediction may be true. My argument is that “I know this by introspection” (or, introspection-and-generalization-to-others) is insufficient. For a concrete example, consider your 5-year-old self. I remember some pretty definite beliefs I had about my future self that turned out wrong, and if I ask myself how aligned I am with it I don’t even know how to answer, he just seems way too confused and incoherent.
I think it’s also not absurd that you do have perfect caring in the sense relevant to the argument. This does not require that you don’t make mistakes currently. If you can, with increasing intelligence/information, correct yourself, then the pointer is perfect in the relevant sense. “Caring about the values of person X” is relatively simple and may come out of evolution whereas “those values directly” may not.
My short answer: Violations of the IID assumption is the likeliest problem in trying to generalize your values, and I see this as the key flaw underlying the post.
What does that mean? Can you give an example to help me follow?
Specifically, it means that you have to deal with generalizing your values to new situations, but without the IID assumption, you can’t just interpolate from existing values anymore, and you will likely overfit to your IID data points, and that’s the better case. In other words, your behavior will be dominated by your inductive biases and priors. And my fear is that given real life examples of intelligence differences that violate IID distributions, things end up misaligned really fast. I’m not saying that we are doomed, but I want to call this out since I think breaking IID will most likely cause Turner to do something really bad to his brother if we allow even one order of magnitude more compute.
Scale this up to human civilization relying on IID distributions in intelligence, and I’m much more careful than Turner is in trying to extrapolate.