Thanks for compiling your thoughts here! There’s a lot to digest, but I’d like to offer a relevant intuition I have specifically about the difficulty of alignment.
Whatever method we use to verify the safety of a particular AI will likely be extremely underdetermined. That is, we could verify that the AI is safe for some set of plausible circumstances but that set of verified situations would be much, much smaller than the set of situations it could encounter “in the wild”.
The AI model, reality, and our values are all high entropy, and our verification/safety methods are likely to be comparatively low entropy. The set of AIs that pass our tests will have members whose properties haven’t been fully constrained.
This isn’t even close to a complete argument, but I’ve found it helpful as an intuition fragment.
Thanks for compiling your thoughts here! There’s a lot to digest, but I’d like to offer a relevant intuition I have specifically about the difficulty of alignment.
Whatever method we use to verify the safety of a particular AI will likely be extremely underdetermined. That is, we could verify that the AI is safe for some set of plausible circumstances but that set of verified situations would be much, much smaller than the set of situations it could encounter “in the wild”.
The AI model, reality, and our values are all high entropy, and our verification/safety methods are likely to be comparatively low entropy. The set of AIs that pass our tests will have members whose properties haven’t been fully constrained.
This isn’t even close to a complete argument, but I’ve found it helpful as an intuition fragment.
I like this intuitive argument.
Now multiply that difficulty by needing to get many more individual AGIs aligned if we see a multipolar scenario, since defending against misaligned AGI is really difficult.