On concrete example 2: I see four bolded claims in ‘fast takeoff is still possible.’ Collectively, to me, in my lexicon and way of thinking about such things, they add up to something very close to ‘alignment is easy.’
The first subsection says human misalignment does not provide evidence for AI misalignment, which isn’t one of the two mechanisms (as I understand this?), and is instead arguing against an alignment difficulty.
The bulk of the second subsection, starting with ‘Let’s consider eight specific alignment techniques,’ looks to me like an explicit argument that alignment looks easy based on your understanding of the history from AI capabilities and alignment developments so far?
The third subsection seems to also spend most of its space on arguing its scenario would involve manageable risks (e.g. alignment being easy), although you also argue that evolution/culture still isn’t ‘close enough’ to teach us anything here?
I can totally see how these sections could have been written out with the core intention to explain how distinct-from-evolution mechanisms could cause fast takeoffs. From my perspective as a reader, I think my response and general takeaway that this is mostly an argument for easy alignment is reasonable on reflection, even if that’s not the core purpose it serves in the underlying structure, and it’s perhaps not a fully general argument.
On concrete example 3: I agree that what I said was a generalization of what you said, and you instead said something more specific. And that your later caveats make it clear you are not so confident that things will go smoothly in the future. So yes I read this wrong and I’m sorry about that.
But also I notice I am confused here—if you didn’t mean for the reader to make this generalization, if you don’t think that failure of current capabilities advances to break current alignment techniques isn’t strong evidence for future capabilities advances not breaking then-optimal alignment techniques, then why we are analyzing all these expected interactions here? Why state the claim that such techniques ‘already generalize’ (which they currently mostly do as far as I know, which is not terribly far) if it isn’t a claim that they will likely generalize in the future?
On concrete example 2: I see four bolded claims in ‘fast takeoff is still possible.’ Collectively, to me, in my lexicon and way of thinking about such things, they add up to something very close to ‘alignment is easy.’
The first subsection says human misalignment does not provide evidence for AI misalignment, which isn’t one of the two mechanisms (as I understand this?), and is instead arguing against an alignment difficulty.
The bulk of the second subsection, starting with ‘Let’s consider eight specific alignment techniques,’ looks to me like an explicit argument that alignment looks easy based on your understanding of the history from AI capabilities and alignment developments so far?
The third subsection seems to also spend most of its space on arguing its scenario would involve manageable risks (e.g. alignment being easy), although you also argue that evolution/culture still isn’t ‘close enough’ to teach us anything here?
I can totally see how these sections could have been written out with the core intention to explain how distinct-from-evolution mechanisms could cause fast takeoffs. From my perspective as a reader, I think my response and general takeaway that this is mostly an argument for easy alignment is reasonable on reflection, even if that’s not the core purpose it serves in the underlying structure, and it’s perhaps not a fully general argument.
On concrete example 3: I agree that what I said was a generalization of what you said, and you instead said something more specific. And that your later caveats make it clear you are not so confident that things will go smoothly in the future. So yes I read this wrong and I’m sorry about that.
But also I notice I am confused here—if you didn’t mean for the reader to make this generalization, if you don’t think that failure of current capabilities advances to break current alignment techniques isn’t strong evidence for future capabilities advances not breaking then-optimal alignment techniques, then why we are analyzing all these expected interactions here? Why state the claim that such techniques ‘already generalize’ (which they currently mostly do as far as I know, which is not terribly far) if it isn’t a claim that they will likely generalize in the future?