I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I’d have actively agreed with the claims.
Things that don’t meet that bar:
General: Lots of these points make claims about what Eliezer is thinking, how his reasoning works, and what evidence it is based on. I don’t necessarily have the same views, primarily because I’ve engaged much less with Eliezer and so don’t have confident Eliezer-models. (They all seem plausible to me, except where I’ve specifically noted disagreements below.)
Agreement 14: Not sure exactly what this is saying. If it’s “the AI will probably always be able to seize control of the physical process implementing the reward calculation and have it output the maximum value” I agree.
Agreement 16: I agree with the general point but I would want to know more about the AI system and how it was trained before evaluating whether it would learn world models + action consequences instead of “just being nice”, and even with the details I expect I’d feel pretty uncertain which was more likely.
Agreement 17: It seems totally fine to focus your attention on a specific subset of “easy-alignment” worlds and ensuring that those worlds survive, which could be described as “assuming there’s a hope”. That being said, there’s something in this vicinity I agree with: in trying to solve alignment, people sometimes make totally implausible assumptions about the world; this is a worse strategy for reducing x-risk than working on the worlds you actually expect and giving them another ingredient that, in combination with a “positive model violation”, could save those worlds.
Disagreement 10: I don’t have a confident take on the primate analogy; I haven’t spent enough time looking into it for that.
Disagreement 15: I read Eliezer as saying something different in point 11 of the list of lethalities than Paul attributes to him here; something more like “if you trained on weak tasks either (1) your AI system will be too weak to build nanotech or (2) it learned the general core of intelligence and will kill you once you get it to try building nanotech”. I’m not confident in my reading though.
Disagreement 18: I find myself pretty uncertain about what to expect in the “breed corrigible humans” thought experiment.
Disagreement 22: I was mostly in agreement with this, but “obsoleting human contributions to alignment” is a pretty high bar if you take it literally, and I don’t feel confident that happens before superintelligent understanding of the world (though it does seem plausible).
On 22, I agree that my claim is incorrect. I think such systems probably won’t obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that’s compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)
I agree with almost all of this, in the sense that if you gave me these claims without telling me where they came from, I’d have actively agreed with the claims.
Things that don’t meet that bar:
General: Lots of these points make claims about what Eliezer is thinking, how his reasoning works, and what evidence it is based on. I don’t necessarily have the same views, primarily because I’ve engaged much less with Eliezer and so don’t have confident Eliezer-models. (They all seem plausible to me, except where I’ve specifically noted disagreements below.)
Agreement 14: Not sure exactly what this is saying. If it’s “the AI will probably always be able to seize control of the physical process implementing the reward calculation and have it output the maximum value” I agree.
Agreement 16: I agree with the general point but I would want to know more about the AI system and how it was trained before evaluating whether it would learn world models + action consequences instead of “just being nice”, and even with the details I expect I’d feel pretty uncertain which was more likely.
Agreement 17: It seems totally fine to focus your attention on a specific subset of “easy-alignment” worlds and ensuring that those worlds survive, which could be described as “assuming there’s a hope”. That being said, there’s something in this vicinity I agree with: in trying to solve alignment, people sometimes make totally implausible assumptions about the world; this is a worse strategy for reducing x-risk than working on the worlds you actually expect and giving them another ingredient that, in combination with a “positive model violation”, could save those worlds.
Disagreement 10: I don’t have a confident take on the primate analogy; I haven’t spent enough time looking into it for that.
Disagreement 15: I read Eliezer as saying something different in point 11 of the list of lethalities than Paul attributes to him here; something more like “if you trained on weak tasks either (1) your AI system will be too weak to build nanotech or (2) it learned the general core of intelligence and will kill you once you get it to try building nanotech”. I’m not confident in my reading though.
Disagreement 18: I find myself pretty uncertain about what to expect in the “breed corrigible humans” thought experiment.
Disagreement 22: I was mostly in agreement with this, but “obsoleting human contributions to alignment” is a pretty high bar if you take it literally, and I don’t feel confident that happens before superintelligent understanding of the world (though it does seem plausible).
On 22, I agree that my claim is incorrect. I think such systems probably won’t obsolete human contributions to alignment while being subhuman in many ways. (I do think their expected contribution to alignment may be large relative to human contributions; but that’s compatible with significant room for humans to add value / to have made contributions that AIs productively build on, since we have different strengths.)
Great, I agree with all of that.