Just riffing on this rather than starting a different comment chain:
If alignment1 is “get AI to follow instructions” (as typically construed in a “good enough” sort of way) and alignment2 is “get AI to do good things and not bad things,” (also in a “good enough” sort of way, but with more assumed philosophical sophistication) I basically don’t care about anyone’s safety plan to get alignment1 except insofar as it’s part of a plan to get alignment2.
Philosophical errors/bottlenecks can mean you don’t know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.
The checklist has a space for “nebulous future safety case for alignment1,” which is totally fine. I just also want a space for “nebulous future safety case for alignment2” at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment2 takes (will it focus on the structure of the institution using an aligned1 AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.
Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power—that we don’t even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.
Just riffing on this rather than starting a different comment chain:
If alignment1 is “get AI to follow instructions” (as typically construed in a “good enough” sort of way) and alignment2 is “get AI to do good things and not bad things,” (also in a “good enough” sort of way, but with more assumed philosophical sophistication) I basically don’t care about anyone’s safety plan to get alignment1 except insofar as it’s part of a plan to get alignment2.
Philosophical errors/bottlenecks can mean you don’t know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.
The checklist has a space for “nebulous future safety case for alignment1,” which is totally fine. I just also want a space for “nebulous future safety case for alignment2” at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment2 takes (will it focus on the structure of the institution using an aligned1 AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.
Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power—that we don’t even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.