Do you expect that if the training setup described in the post produces a superhuman-level AGI, it does not kill everyone?
>5% under current uncertainty.
No behavior we see until a superhuman cognitive structure plays a major role in producing it gives much evidence into what goals that cognitive structure might have.
Are you saying that pre-superhuman behavior doesn’t tell you about its goals? Like, zero mutual information? Doesn’t this prove too much, without relying on more details of the training process? By observing a 5-year-old, you can definitely gather evidence about their adult goals, you just have to interpret it skillfully (which is harder for AIs, of course).
Before there’s a superhuman-level cognitive structure, circuits noticeable in the neural network don’t tell you what goals that cognitive architecture will end up pursuing upon reflection.
I understand this to mean: “If you understand an AI’s motivations before it’s superhuman, that tells you relatively little about its post-reflection values.” I strongly disagree. Isn’t the whole point of the AI improving itself, in order to better achieve its goals at the time of self-improvement?
A significant part of my point was that concrete stories are needed when you expect the process to succeed, not when you expect it to fail.
I also disagree with this. I think that alignment thinking is plagued by nonspecific, nonconcrete abstract failure modes which may or may not correspond to reasonable chains of events. Often I worry that it’s just abstract reasoning all the way down—that an alignment researcher has never sketched out an actual detailed example of a situation which the abstract words describe.
For example, I think I have very little idea what the sharp left turn is supposed to be. If Nate wrote out a very detailed story, I think I would understand. I might disagree with e.g. how he thinks SGD dynamics work, but I could read the story and say “oh, because Nate thinks that time-bias allows faster circuits to gain more control over cognition, they can ‘betray’ the other motivational circuits and execute an internal coup, and we got here because [the rest of Nate’s story].”
(Importantly, these details would have to be concrete. Not “you train the AI and it stops doing what you want”, that’s not a specific concrete situation.)
But right now, there’s a strong focus on possibly inappropriate analogies with evolution. That doesn’t mean Nate is wrong. It means I don’t know what he’s talking about. I really wish I did, because he is a smart guy, and I’d like to know whether I agree or disagree with his models.
If I’m choosing futures on the basis of whether I think they lead to lots of diamonds, why do I need to keep improving that value in order to keep wanting to make diamonds?
I’m not aware of a way to select features leading to lots of diamonds when these features are present in a superhuman AGI.
I was referring to a situation where the AI already is selecting plans on the basis of whether they lead to diamonds. This is, by assumption, a fact about its motivations. I perceive you to believe that e.g. the AI needs to keep “improving” its “diamond value” in order to, later in training, still select plans on the basis of diamonds.
If so—what does this mean? Why would that be true?
Thanks for your detailed and thoughtful response!
>5% under current uncertainty.
Are you saying that pre-superhuman behavior doesn’t tell you about its goals? Like, zero mutual information? Doesn’t this prove too much, without relying on more details of the training process? By observing a 5-year-old, you can definitely gather evidence about their adult goals, you just have to interpret it skillfully (which is harder for AIs, of course).
I understand this to mean: “If you understand an AI’s motivations before it’s superhuman, that tells you relatively little about its post-reflection values.” I strongly disagree. Isn’t the whole point of the AI improving itself, in order to better achieve its goals at the time of self-improvement?
I also disagree with this. I think that alignment thinking is plagued by nonspecific, nonconcrete abstract failure modes which may or may not correspond to reasonable chains of events. Often I worry that it’s just abstract reasoning all the way down—that an alignment researcher has never sketched out an actual detailed example of a situation which the abstract words describe.
For example, I think I have very little idea what the sharp left turn is supposed to be. If Nate wrote out a very detailed story, I think I would understand. I might disagree with e.g. how he thinks SGD dynamics work, but I could read the story and say “oh, because Nate thinks that time-bias allows faster circuits to gain more control over cognition, they can ‘betray’ the other motivational circuits and execute an internal coup, and we got here because [the rest of Nate’s story].”
(Importantly, these details would have to be concrete. Not “you train the AI and it stops doing what you want”, that’s not a specific concrete situation.)
But right now, there’s a strong focus on possibly inappropriate analogies with evolution. That doesn’t mean Nate is wrong. It means I don’t know what he’s talking about. I really wish I did, because he is a smart guy, and I’d like to know whether I agree or disagree with his models.
I was referring to a situation where the AI already is selecting plans on the basis of whether they lead to diamonds. This is, by assumption, a fact about its motivations. I perceive you to believe that e.g. the AI needs to keep “improving” its “diamond value” in order to, later in training, still select plans on the basis of diamonds.
If so—what does this mean? Why would that be true?