This does look to me like a good formalization of the standard argument, and so this formalization makes it possible to analyze the weaknesses of the standard argument.
The weak point here seems to be “Harm from AI is proportional to (capabilities)x(misalignment)”, because the argument seems to implicitly assume the usual strong definition of alignment: “Future AI systems will likely be not exactly aligned with human values”.
But, in reality, there are vital aspects of alignment (perhaps we should start calling them partial alignment), such as care about well-being and freedom of humans, and only the misalignment with those would cause harm (whereas some human values, such as those leading to widespread practice of factory farming and many others, better be skipped and not aligned to, because they would lead to disaster when combined with increased capabilities).
The Lemma does not apply to partial alignment.
It is true than we don’t know how to safely instill arbitrary values into advanced AI systems (and that might be a good thing, because arbitrary values can be chosen in such a way that they cause plenty of harm).
However, some values might be sufficiently invariant to be natural for some versions of AI systems. E.g. it might turn out that care about “well-being and freedom of all sentient beings” is natural for some AI ecosystems (e.g. one can make an argument that for such AI ecosystems which include persistent sentiences within the AI ecosystem in question, the “well-being and freedom of all sentient beings” might become a natural value and goal).
This does look to me like a good formalization of the standard argument, and so this formalization makes it possible to analyze the weaknesses of the standard argument.
The weak point here seems to be “Harm from AI is proportional to (capabilities)x(misalignment)”, because the argument seems to implicitly assume the usual strong definition of alignment: “Future AI systems will likely be not exactly aligned with human values”.
But, in reality, there are vital aspects of alignment (perhaps we should start calling them partial alignment), such as care about well-being and freedom of humans, and only the misalignment with those would cause harm (whereas some human values, such as those leading to widespread practice of factory farming and many others, better be skipped and not aligned to, because they would lead to disaster when combined with increased capabilities).
The Lemma does not apply to partial alignment.
It is true than we don’t know how to safely instill arbitrary values into advanced AI systems (and that might be a good thing, because arbitrary values can be chosen in such a way that they cause plenty of harm).
However, some values might be sufficiently invariant to be natural for some versions of AI systems. E.g. it might turn out that care about “well-being and freedom of all sentient beings” is natural for some AI ecosystems (e.g. one can make an argument that for such AI ecosystems which include persistent sentiences within the AI ecosystem in question, the “well-being and freedom of all sentient beings” might become a natural value and goal).