Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.
I’m not sure if I’ve ever seen this stated explicitly, but this is essentially a thermodynamic argument. So to me, arguing against “alignment is hard” feels a lot like arguing “But why can’t this one be a perpetual motion machine of the second kind?” And the answer there is, “Ok fine, heat being spontaneously converted to work isn’t literally physically impossible, but the degree to which it is super-exponentially unlikely is greater than our puny human minds can really comprehend, and this is true for almost any set of laws of physics that might exist in any universe that can be said to have laws of physics at all.”
I’m not sure if I’ve ever seen this stated explicitly, but this is essentially a thermodynamic argument. So to me, arguing against “alignment is hard” feels a lot like arguing “But why can’t this one be a perpetual motion machine of the second kind?” And the answer there is, “Ok fine, heat being spontaneously converted to work isn’t literally physically impossible, but the degree to which it is super-exponentially unlikely is greater than our puny human minds can really comprehend, and this is true for almost any set of laws of physics that might exist in any universe that can be said to have laws of physics at all.”