I claim that GPT-4 is already pretty good at extracting preferences from human data.
So this seems to me like it’s the crux. I agree with you that GPT-4 is “pretty good”, but I think the standard necessary for things to go well is substantially higher than “pretty good”, and that’s where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment. My guess is Eliezer, Rob, and Nate feel basically the same way.
Basically, I think your later section—”Maybe you think”—is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. “Philosophy with a deadline” would be a weird way to put it if you thought contemporary philosophy was good enough.
So this seems to me like it’s the crux. I agree with you that GPT-4 is “pretty good”, but I think the standard necessary for things to go well is substantially higher than “pretty good”, and that’s where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment.
This makes sense to me. On the other hand—it feels like there’s some motte and bailey going on here, if one claim is “if the AIs get really superhumanly capable then we need a much higher standard than pretty good”, but then it’s illustrated using examples like “think of how your AI might not understand what you meant if you asked it to get your mother out of a burning building”.
I don’t understand your objection. A more capable AI might understand that it’s completely sufficient to tell you that your mother is doing fine, and simulate a phone call with her to keep you happy. Or it just talks you into not wanting to confirm in more detail, etc. I’d expect that the problem wouldn’t be to get the AI what you want to do in a specific supervised setting, but to remain in control of the overall situation, which includes being able to rely on the AI’s actions not having any ramifications beyond it’s narrow task.
The question is how do you even train the AI under the current paradigm once “human preferences” stops being a standard for evaluation and just becomes another aspect of the AIs world model, that needs to be navigated.
Basically, I think your later section—”Maybe you think”—is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. “Philosophy with a deadline” would be a weird way to put it if you thought contemporary philosophy was good enough.
I don’t think this is the crux. E.g., I’d wager the number of bits you need to get into an ASI’s goals in order to make it corrigible is quite a bit smaller than the number of bits required to make an ASI behave like a trustworthy human, which in turn is way way smaller than the number of bits required to make an ASI implement CEV.
The issue is that (a) the absolute number of bits for each of these things is still very large, (b) insofar as we’re training for deep competence and efficiency we’re training against corrigibility (which makes it hard to hit both targets at once), and (c) we can’t safely or efficiently provide good training data for a lot of the things we care about (e.g., ‘if you’re a superintelligence operating in a realistic-looking environment, don’t do any of the things that destroy the world’).
None of these points require that we (or the AI) solve novel moral philosophy problems. I’d be satisfied with an AI that corrigibly built scanning tech and efficient computing hardware for whole-brain emulation, then shut itself down; the AI plausibly doesn’t even need to think about any of the world outside of a particular room, much less solve tricky questions of population ethics or whatever.
So this seems to me like it’s the crux. I agree with you that GPT-4 is “pretty good”, but I think the standard necessary for things to go well is substantially higher than “pretty good”
That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that’s “about as good as human judgement” in the near future. Do you doubt that? If you or anyone else at MIRI doubts that, then I’d be interested in making this prediction more precise, and potentially offering to bet MIRI people on this claim.
requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people
If MIRI people think that the problem here is that our AIs need to be more moral than even humans, then I don’t see where MIRI people think the danger comes from on this particular issue, especially when it comes to avoiding human extinction. Some questions:
Why did Eliezer and Nate talk about stories like Micky Mouse commanding a magical broom to fill a cauldron, and then failing because of misspecification, if the problem was actually more about getting the magical broom to exhibit superhuman moral judgement?
Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
Eliezer has said on multiple separate occasions that he’d prefer that we try human intelligent enhancement or try uploading alignment researchers onto computers before creating de novo AGI. But uploaded and enhanced humans aren’t going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe?
CEV was about this; talk about philosophical competence or metaphilosophy was about this.
I mostly saw CEV as an aspirational goal. It’s seems more like a grand prize that we could best hope for if we solved every aspect of the alignment problem, rather than a minimal bar that Eliezer was setting for avoiding human extinction.
When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get.
That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that’s “about as good as human judgement” in the near future.
We already have humans who are smart enough to do par-human moral reasoning. For “AI can do par-human moral reasoning” to help solve the alignment problem, there needs to be some additional benefit to having AI systems that can match a human (e.g., some benefit to our being able to produce enormous numbers of novel moral judgments without relying on an existing text corpus or hiring thousands of humans to produce them). Do you have some benefit in mind?
I don’t think the critical point of contention here is about whether par-human moral reasoning will help with alignment. It could, but I’m not making that argument. I’m primarily making the argument that specifying the human value function, or getting an AI to reflect back (and not merely passively understand) the human value function, seems easier than many past comments from MIRI people suggest. This problem is one aspect of the alignment problem, although by no means all of it, and I think it’s important to point out that we seem to be approaching an adequate solution.
Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
For me, the answer here is “probably yes”; I think there is some bar of ‘moral’ and ‘intelligent’ where this doesn’t happen, but I don’t feel confident about where it is.
I think there are two things that I expect to be big issues, and probably more I’m not thinking of:
Managing freedom for others while not allowing for catastrophic risks; I think lots of ways to mismanage that balance result in ‘destroying the world’, probably with different levels of moral loss.
The relevant morality is different for different social roles—someone being a good neighbor does not make them a good judge or good general. Even if someone scores highly on a ‘general factor of morality’ (assuming that such a thing exists) it is not obvious they will make for a good god-emperor. There is relatively little grounded human thought on how to be a good god-emperor. [Another way to put this is that “preserving their moral faculties” is not obviously enough / a good standard; probably their moral faculties should develop a lot in contact with their new situation!]
But uploaded and enhanced humans aren’t going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe?
I understand Eliezer’s position to be that 1) intelligence helps with moral judgment and 2) it’s better to start with biological humans than whatever AI design is best at your intelligence-related subtask, but also that intelligence amplification is dicey business and this is more like “the least bad option” than one that seems actively good.
Like we have some experience inculcating moral values in humans that will probably generalize better to augmented humans than it will to AIs; but also I think Eliezer is more optimistic (for timing reasons) about amplifications that can be done to adult humans.
Yeah, my interpretation of that is “if your target is the human level of wisdom, it will destroy humans just like humans are on track to do.” If someone is thinking “will this be as good as the Democrats being in charge or the Republicans being in charge?” they are not grappling with the difficulty of successfully wielding futuristically massive amounts of power.
So this seems to me like it’s the crux. I agree with you that GPT-4 is “pretty good”, but I think the standard necessary for things to go well is substantially higher than “pretty good”, and that’s where the difficulty arises once we start applying higher and higher levels of capability and influence on the environment. My guess is Eliezer, Rob, and Nate feel basically the same way.
Basically, I think your later section—”Maybe you think”—is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. “Philosophy with a deadline” would be a weird way to put it if you thought contemporary philosophy was good enough.
This makes sense to me. On the other hand—it feels like there’s some motte and bailey going on here, if one claim is “if the AIs get really superhumanly capable then we need a much higher standard than pretty good”, but then it’s illustrated using examples like “think of how your AI might not understand what you meant if you asked it to get your mother out of a burning building”.
I don’t understand your objection. A more capable AI might understand that it’s completely sufficient to tell you that your mother is doing fine, and simulate a phone call with her to keep you happy. Or it just talks you into not wanting to confirm in more detail, etc. I’d expect that the problem wouldn’t be to get the AI what you want to do in a specific supervised setting, but to remain in control of the overall situation, which includes being able to rely on the AI’s actions not having any ramifications beyond it’s narrow task.
The question is how do you even train the AI under the current paradigm once “human preferences” stops being a standard for evaluation and just becomes another aspect of the AIs world model, that needs to be navigated.
I don’t think this is the crux. E.g., I’d wager the number of bits you need to get into an ASI’s goals in order to make it corrigible is quite a bit smaller than the number of bits required to make an ASI behave like a trustworthy human, which in turn is way way smaller than the number of bits required to make an ASI implement CEV.
The issue is that (a) the absolute number of bits for each of these things is still very large, (b) insofar as we’re training for deep competence and efficiency we’re training against corrigibility (which makes it hard to hit both targets at once), and (c) we can’t safely or efficiently provide good training data for a lot of the things we care about (e.g., ‘if you’re a superintelligence operating in a realistic-looking environment, don’t do any of the things that destroy the world’).
None of these points require that we (or the AI) solve novel moral philosophy problems. I’d be satisfied with an AI that corrigibly built scanning tech and efficient computing hardware for whole-brain emulation, then shut itself down; the AI plausibly doesn’t even need to think about any of the world outside of a particular room, much less solve tricky questions of population ethics or whatever.
That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that’s “about as good as human judgement” in the near future. Do you doubt that? If you or anyone else at MIRI doubts that, then I’d be interested in making this prediction more precise, and potentially offering to bet MIRI people on this claim.
If MIRI people think that the problem here is that our AIs need to be more moral than even humans, then I don’t see where MIRI people think the danger comes from on this particular issue, especially when it comes to avoiding human extinction. Some questions:
Why did Eliezer and Nate talk about stories like Micky Mouse commanding a magical broom to fill a cauldron, and then failing because of misspecification, if the problem was actually more about getting the magical broom to exhibit superhuman moral judgement?
Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
Eliezer has said on multiple separate occasions that he’d prefer that we try human intelligent enhancement or try uploading alignment researchers onto computers before creating de novo AGI. But uploaded and enhanced humans aren’t going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe?
I mostly saw CEV as an aspirational goal. It’s seems more like a grand prize that we could best hope for if we solved every aspect of the alignment problem, rather than a minimal bar that Eliezer was setting for avoiding human extinction.
ETA: in Eliezer’s AGI ruin post, he says,
We already have humans who are smart enough to do par-human moral reasoning. For “AI can do par-human moral reasoning” to help solve the alignment problem, there needs to be some additional benefit to having AI systems that can match a human (e.g., some benefit to our being able to produce enormous numbers of novel moral judgments without relying on an existing text corpus or hiring thousands of humans to produce them). Do you have some benefit in mind?
I don’t think the critical point of contention here is about whether par-human moral reasoning will help with alignment. It could, but I’m not making that argument. I’m primarily making the argument that specifying the human value function, or getting an AI to reflect back (and not merely passively understand) the human value function, seems easier than many past comments from MIRI people suggest. This problem is one aspect of the alignment problem, although by no means all of it, and I think it’s important to point out that we seem to be approaching an adequate solution.
For me, the answer here is “probably yes”; I think there is some bar of ‘moral’ and ‘intelligent’ where this doesn’t happen, but I don’t feel confident about where it is.
I think there are two things that I expect to be big issues, and probably more I’m not thinking of:
Managing freedom for others while not allowing for catastrophic risks; I think lots of ways to mismanage that balance result in ‘destroying the world’, probably with different levels of moral loss.
The relevant morality is different for different social roles—someone being a good neighbor does not make them a good judge or good general. Even if someone scores highly on a ‘general factor of morality’ (assuming that such a thing exists) it is not obvious they will make for a good god-emperor. There is relatively little grounded human thought on how to be a good god-emperor. [Another way to put this is that “preserving their moral faculties” is not obviously enough / a good standard; probably their moral faculties should develop a lot in contact with their new situation!]
I understand Eliezer’s position to be that 1) intelligence helps with moral judgment and 2) it’s better to start with biological humans than whatever AI design is best at your intelligence-related subtask, but also that intelligence amplification is dicey business and this is more like “the least bad option” than one that seems actively good.
Like we have some experience inculcating moral values in humans that will probably generalize better to augmented humans than it will to AIs; but also I think Eliezer is more optimistic (for timing reasons) about amplifications that can be done to adult humans.
Yeah, my interpretation of that is “if your target is the human level of wisdom, it will destroy humans just like humans are on track to do.” If someone is thinking “will this be as good as the Democrats being in charge or the Republicans being in charge?” they are not grappling with the difficulty of successfully wielding futuristically massive amounts of power.