human safety problems—given that alignment is defined “in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy”, why should we believe that AI developers and/or parts of governments that can coerce AI developers will steer the AI systems in a good direction? E.g., that they won’t be corrupted by power or persuasion or distributional shift, and are benevolent to begin with.
philosophical errors or bottlenecks—there’s a single mention of “wisdom” at the end, but nothing about how to achieve/ensure the unprecedented amount of wisdom or speed of philosophical progress that would be needed to navigate something this novel, complex, and momentous. The OP seems to suggest punting such problems to “outside consensus” or “institutions or processes”, with apparently no thought towards whether such consensus/institutions/processes would be up to the task or what AI developers can do to help (e.g., by increasing AI philosophical competence).
Like others I also applaud Sam for writing this, but the actual content makes me more worried, as it’s evidence that AI developers are not thinking seriously about some major risks and risk factors.
Just riffing on this rather than starting a different comment chain:
If alignment1 is “get AI to follow instructions” (as typically construed in a “good enough” sort of way) and alignment2 is “get AI to do good things and not bad things,” (also in a “good enough” sort of way, but with more assumed philosophical sophistication) I basically don’t care about anyone’s safety plan to get alignment1 except insofar as it’s part of a plan to get alignment2.
Philosophical errors/bottlenecks can mean you don’t know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.
The checklist has a space for “nebulous future safety case for alignment1,” which is totally fine. I just also want a space for “nebulous future safety case for alignment2” at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment2 takes (will it focus on the structure of the institution using an aligned1 AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.
Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power—that we don’t even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.
Unfortunately this ignores 3 major issues:
race dynamics (also pointed out by Akash)
human safety problems—given that alignment is defined “in the narrow sense of making sure AI developers can confidently steer the behavior of the AI systems they deploy”, why should we believe that AI developers and/or parts of governments that can coerce AI developers will steer the AI systems in a good direction? E.g., that they won’t be corrupted by power or persuasion or distributional shift, and are benevolent to begin with.
philosophical errors or bottlenecks—there’s a single mention of “wisdom” at the end, but nothing about how to achieve/ensure the unprecedented amount of wisdom or speed of philosophical progress that would be needed to navigate something this novel, complex, and momentous. The OP seems to suggest punting such problems to “outside consensus” or “institutions or processes”, with apparently no thought towards whether such consensus/institutions/processes would be up to the task or what AI developers can do to help (e.g., by increasing AI philosophical competence).
Like others I also applaud Sam for writing this, but the actual content makes me more worried, as it’s evidence that AI developers are not thinking seriously about some major risks and risk factors.
Just riffing on this rather than starting a different comment chain:
If alignment1 is “get AI to follow instructions” (as typically construed in a “good enough” sort of way) and alignment2 is “get AI to do good things and not bad things,” (also in a “good enough” sort of way, but with more assumed philosophical sophistication) I basically don’t care about anyone’s safety plan to get alignment1 except insofar as it’s part of a plan to get alignment2.
Philosophical errors/bottlenecks can mean you don’t know how to go from 1 to 2. Human safety problems are what stop you from going from 1 to 2 even if you know how, or stop you from trying to find out how.
The checklist has a space for “nebulous future safety case for alignment1,” which is totally fine. I just also want a space for “nebulous future safety case for alignment2” at the least (some earlier items explicitly about progressing towards that safety case can be extra credit). Different people might have different ideas about what form a plan for alignment2 takes (will it focus on the structure of the institution using an aligned1 AI, or will it focus on the AI and its training procedure directly?), and where having it should come in the timeline, but I think it should be somewhere.
Part of what makes power corrupting insidious is it seems obvious to humans that we can make everything work out best so long as we have power—that we don’t even need to plan for how to get from having control to actually getting good things control was supposed to be an instrumental goal for.