Matthew Barnett comments on Evaluating the historical value misspecification argument

Matthew Barnett 5 Oct 2023 19:53 UTC
2 points
1
AF
So this seems to me like it’s the crux. I agree with you that GPT-4 is “pretty good”, but I think the standard necessary for things to go well is substantially higher than “pretty good”
That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that’s “about as good as human judgement” in the near future. Do you doubt that? If you or anyone else at MIRI doubts that, then I’d be interested in making this prediction more precise, and potentially offering to bet MIRI people on this claim.
requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people
If MIRI people think that the problem here is that our AIs need to be more moral than even humans, then I don’t see where MIRI people think the danger comes from on this particular issue, especially when it comes to avoiding human extinction. Some questions:
- Why did Eliezer and Nate talk about stories like Micky Mouse commanding a magical broom to fill a cauldron, and then failing because of misspecification, if the problem was actually more about getting the magical broom to exhibit superhuman moral judgement?
- Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
- Eliezer has said on multiple separate occasions that he’d prefer that we try human intelligent enhancement or try uploading alignment researchers onto computers before creating de novo AGI. But uploaded and enhanced humans aren’t going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe?
CEV was about this; talk about philosophical competence or metaphilosophy was about this.
I mostly saw CEV as an aspirational goal. It’s seems more like a grand prize that we could best hope for if we solved every aspect of the alignment problem, rather than a minimal bar that Eliezer was setting for avoiding human extinction.
ETA: in Eliezer’s AGI ruin post, he says,
When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone. When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get.
- Rob Bensinger 5 Oct 2023 20:53 UTC
  LW: 10 AF: 8
  1
  AF Parent
  That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that’s “about as good as human judgement” in the near future.
  We already have humans who are smart enough to do par-human moral reasoning. For “AI can do par-human moral reasoning” to help solve the alignment problem, there needs to be some additional benefit to having AI systems that can match a human (e.g., some benefit to our being able to produce enormous numbers of novel moral judgments without relying on an existing text corpus or hiring thousands of humans to produce them). Do you have some benefit in mind?
  - Matthew Barnett 5 Oct 2023 21:56 UTC
    LW: 7 AF: 2
    2
    AF Parent
    I don’t think the critical point of contention here is about whether par-human moral reasoning will help with alignment. It could, but I’m not making that argument. I’m primarily making the argument that specifying the human value function, or getting an AI to reflect back (and not merely passively understand) the human value function, seems easier than many past comments from MIRI people suggest. This problem is one aspect of the alignment problem, although by no means all of it, and I think it’s important to point out that we seem to be approaching an adequate solution.
- Vaniver 5 Oct 2023 20:20 UTC
  LW: 6 AF: 5
  0
  AF Parent
  Are MIRI people claiming that if, say, a very moral and intelligent human became godlike while preserving their moral faculties, that they would destroy the world despite, or perhaps because of, their best intentions?
  For me, the answer here is “probably yes”; I think there is some bar of ‘moral’ and ‘intelligent’ where this doesn’t happen, but I don’t feel confident about where it is.
  I think there are two things that I expect to be big issues, and probably more I’m not thinking of:
  - Managing freedom for others while not allowing for catastrophic risks; I think lots of ways to mismanage that balance result in ‘destroying the world’, probably with different levels of moral loss.
  - The relevant morality is different for different social roles—someone being a good neighbor does not make them a good judge or good general. Even if someone scores highly on a ‘general factor of morality’ (assuming that such a thing exists) it is not obvious they will make for a good god-emperor. There is relatively little grounded human thought on how to be a good god-emperor. [Another way to put this is that “preserving their moral faculties” is not obviously enough / a good standard; probably their moral faculties should develop a lot in contact with their new situation!]
  But uploaded and enhanced humans aren’t going to have superhuman moral judgement. How does this strategy interact with the claim that we need far better-than-human moral judgement to avoid a catastrophe?
  I understand Eliezer’s position to be that 1) intelligence helps with moral judgment and 2) it’s better to start with biological humans than whatever AI design is best at your intelligence-related subtask, but also that intelligence amplification is dicey business and this is more like “the least bad option” than one that seems actively good.
  Like we have some experience inculcating moral values in humans that will probably generalize better to augmented humans than it will to AIs; but also I think Eliezer is more optimistic (for timing reasons) about amplifications that can be done to adult humans.
  ETA: in Eliezer’s AGI ruin post, he says,
  Yeah, my interpretation of that is “if your target is the human level of wisdom, it will destroy humans just like humans are on track to do.” If someone is thinking “will this be as good as the Democrats being in charge or the Republicans being in charge?” they are not grappling with the difficulty of successfully wielding futuristically massive amounts of power.