Charlie Steiner comments on The Checklist: What Succeeding at AI Safety Will Involve

Charlie Steiner 4 Sep 2024 6:51 UTC
LW: 22 AF: 9
7
AF
Just riffing on this rather than starting a different comment chain:
If alignment $_{1}$ is “get AI to follow instructions” (as typically construed in a “good enough” sort of way) and alignment $_{2}$ is “get AI to do good things and not bad things,” (also in a “good enough” sort of way, but with more assumed philosophical sophistication) I basically don’t care about anyone’s safety plan to get alignment $_{1}$ except insofar as it’s part of a plan to get alignment $_{2}$ .
Philosophical errors/bottlenecks can mean you don’t know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.
The checklist has a space for “nebulous future safety case for alignment $_{1}$ ,” which is totally fine. I just also want a space for “nebulous future safety case for alignment $_{2}$ ” at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment $_{2}$ takes (will it focus on the structure of the institution using an aligned $_{1}$ AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.
Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power—that we don’t even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.