I think you missed some basic details about what I wrote. I encourage people to compare what Eliezer is saying here to what I actually wrote. You said:
If you think you’ve demonstrated by clever textual close reading that Eliezer-2018 or Eliezer-2008 thought that it would be hard to get a superintelligence to understand humans, you have arrived at a contradiction and need to back up and start over.
I never said that you or any other MIRI person thought it would be “hard to get a superintelligence to understand humans”. Here’s what I actually wrote:
Non-MIRI people sometimes strawman MIRI people as having said that AGI would literally lack an understanding of human values. I don’t endorse this, and I’m not saying this.
[...]
I agree that MIRI people never thought the problem was about getting AI to merely understand human values, and that they have generally maintained there was extra difficulty in getting an AI to care about human values. However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem), for example in this 2016 talk from Yudkowsky.[3] The value identification problem is the problem of “pinpointing valuable outcomes to an advanced agent and distinguishing them from non-valuable outcomes”. In other words, it’s the problem of specifying a function that reflects the “human value function” with high fidelity.
I mostly don’t think that the points you made in your comment respond to what I said. My best guess is that you’re responding to a stock character who represents the people who have given similar arguments to you repeatedly in the past. In light of your personal situation, I’m actually quite sympathetic to you responding this way. I’ve seen my fair share of people misinterpreting you on social media too. It can be frustrating to hear the same bad arguments, often made from people with poor intentions, over and over again and continue to engage thoughtfully each time. I just don’t think I’m making the same mistakes as those people. I tried to distinguish myself from them in the post.
I would find it slightly exhausting to reply to all of this comment, given that I think you misrepresented me in a big way right out of the gate, so I’m currently not sure if I want to put in the time to compile a detailed response.
That said, I think some of the things you said in this comment were nice, and helped to clarify your views on this subject. I admit that I may have misinterpreted some of the comments you made, and if you provide specific examples, I’m happy to retract or correct them. I’m thankful that you spent the time to engage. :)
Without digging in too much, I’ll say that this exchange and the OP is pretty confusing to me. It sounds like MB is like “MIRI doesn’t say it’s hard to get an AI that has a value function” and then also says “GPT has the value function, so MIRI should update”. This seems almost contradictory.
A guess: MB is saying “MIRI doesn’t say the AI won’t have the function somewhere, but does say it’s hard to have an externally usable, explicit human value function”. And then saying “and GPT gives us that”, and therefore MIRI should update.
And EY is blobbing those two things together, and saying neither of them is the really hard part. Even having the externally usable explicit human value function doesn’t mean the AI cares about it. And it’s still a lot of bits, even if you have the bits. So it’s still true that the part about getting the AI to care has to go precisely right.
If there’s a substantive disagreement about the facts here (rather than about the discourse history or whatever), maybe it’s like:
Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
A guess: MB is saying “MIRI doesn’t say the AI won’t have the function somewhere, but does say it’s hard to have an externally usable, explicit human value function”. And then saying “and GPT gives us that”, and therefore MIRI should update.
[...]
Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
I consider this a reasonably accurate summary of this discussion, especially the part I’m playing in it. Thanks for making it more clear to others.
Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
To which I say: “dial a random phone number and ask the person who answers what’s good” can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn’t crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
This is a bad analogy. Phoning a human fails dominantly because humans are less smart than the ASI they would be trying to wrangle. Contra, Yudkowsky has even said that were you to bootstrap human intelligence directly, there is a nontrivial shot that the result is good. This difference is load bearing!
This does get to the heart of the disagreement, which I’m going to try to badly tap out on my phone.
The old, MIRI-style framing was essentially: we are going to build an AGI out of parts that are not intrinsically grounded in human values, but rather good abstract reasoning, during execution of which human values will be accurately deduced, and as this is after the point of construction, we hit the challenge of formally specifying what properties we want to preserve without being able to point to those runtime properties at specification.
The newer, contrasting framing is essentially: we are going to bulld an AGI out of parts that already have strong intrinsic, conceptual-level understanding of the values we want them to preserve, and being able to directly point at those values is actually needle-moving towards getting a good outcome. This is hard to do right now, with poor interpretability and steerability of these systems, but is nonetheless a relevant component of a potential solution.
It’s more like calling a human who’s as smart as you are and directly plugged into your brain and in fact reusing your world model and train of thought directly to understand the implications of your decision. That’s a huge step up from calling a real human over the phone!
The reason the real human proposal doesn’t work is that
the humans you call will lack context on your decision
they won’t even be able to receive all the context
they’re dumber and slower than you so even if you really could write out your entire chain of thoughts and intuitions consulting them for every decision would be impractical
Note that none of these considerations apply to integrated language models!
I think you missed some basic details about what I wrote. I encourage people to compare what Eliezer is saying here to what I actually wrote. You said:
I never said that you or any other MIRI person thought it would be “hard to get a superintelligence to understand humans”. Here’s what I actually wrote:
I mostly don’t think that the points you made in your comment respond to what I said. My best guess is that you’re responding to a stock character who represents the people who have given similar arguments to you repeatedly in the past. In light of your personal situation, I’m actually quite sympathetic to you responding this way. I’ve seen my fair share of people misinterpreting you on social media too. It can be frustrating to hear the same bad arguments, often made from people with poor intentions, over and over again and continue to engage thoughtfully each time. I just don’t think I’m making the same mistakes as those people. I tried to distinguish myself from them in the post.
I would find it slightly exhausting to reply to all of this comment, given that I think you misrepresented me in a big way right out of the gate, so I’m currently not sure if I want to put in the time to compile a detailed response.
That said, I think some of the things you said in this comment were nice, and helped to clarify your views on this subject. I admit that I may have misinterpreted some of the comments you made, and if you provide specific examples, I’m happy to retract or correct them. I’m thankful that you spent the time to engage. :)
Without digging in too much, I’ll say that this exchange and the OP is pretty confusing to me. It sounds like MB is like “MIRI doesn’t say it’s hard to get an AI that has a value function” and then also says “GPT has the value function, so MIRI should update”. This seems almost contradictory.
A guess: MB is saying “MIRI doesn’t say the AI won’t have the function somewhere, but does say it’s hard to have an externally usable, explicit human value function”. And then saying “and GPT gives us that”, and therefore MIRI should update.
And EY is blobbing those two things together, and saying neither of them is the really hard part. Even having the externally usable explicit human value function doesn’t mean the AI cares about it. And it’s still a lot of bits, even if you have the bits. So it’s still true that the part about getting the AI to care has to go precisely right.
If there’s a substantive disagreement about the facts here (rather than about the discourse history or whatever), maybe it’s like:
Straw-EY: Complexity of value means you can’t just get the make-AI-care part to happen by chance; it’s a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says “and now call GPT and ask it what’s good”. So now it’s a very small number of bits.
I consider this a reasonably accurate summary of this discussion, especially the part I’m playing in it. Thanks for making it more clear to others.
To which I say: “dial a random phone number and ask the person who answers what’s good” can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn’t crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
This is a bad analogy. Phoning a human fails dominantly because humans are less smart than the ASI they would be trying to wrangle. Contra, Yudkowsky has even said that were you to bootstrap human intelligence directly, there is a nontrivial shot that the result is good. This difference is load bearing!
This does get to the heart of the disagreement, which I’m going to try to badly tap out on my phone.
The old, MIRI-style framing was essentially: we are going to build an AGI out of parts that are not intrinsically grounded in human values, but rather good abstract reasoning, during execution of which human values will be accurately deduced, and as this is after the point of construction, we hit the challenge of formally specifying what properties we want to preserve without being able to point to those runtime properties at specification.
The newer, contrasting framing is essentially: we are going to bulld an AGI out of parts that already have strong intrinsic, conceptual-level understanding of the values we want them to preserve, and being able to directly point at those values is actually needle-moving towards getting a good outcome. This is hard to do right now, with poor interpretability and steerability of these systems, but is nonetheless a relevant component of a potential solution.
It’s more like calling a human who’s as smart as you are and directly plugged into your brain and in fact reusing your world model and train of thought directly to understand the implications of your decision. That’s a huge step up from calling a real human over the phone!
The reason the real human proposal doesn’t work is that
the humans you call will lack context on your decision
they won’t even be able to receive all the context
they’re dumber and slower than you so even if you really could write out your entire chain of thoughts and intuitions consulting them for every decision would be impractical
Note that none of these considerations apply to integrated language models!
Maybe it’ll be “and now call GPT and ask it what Sam Altman thinks is good” instead