Important thing to bear in mind here: the relevant point for comparison is not the fantasy-world where the Godzilla-vs-Mega-Godzilla fight happens exactly the way the clever elaborate scheme imagined. The relevant point for comparison is the realistic-world where something went wrong, and the elaborate clever scheme fell apart, and now there’s monsters rampaging around anyway.
Some people, when confronted with a problem Mega-Godzilla, think “I know, I’ll use regular expressions Godzilla.” Now they have two problems rampaging monsters.
Epistemic Status: Low. Very likely wrong but would like to understand why.
It seems easier to intent align a human level or slightly above human level AI (HLAI) than a massively smarter than human AI.
Some new research options become available to us once we have aligned HLAI, including:
The HLAI might be able to directly help us do alignment research and solve the general alignment problem.
We could run experiments on the HLAI and get experimental evidence much closer to the domain we are actually trying to solve.
We could use the HLAI to start a training procedure, a la IDA.
These schemes seem fragile, because 1) if any HLAIs are not aligned, we lose, and 2) if the training up to superintelligence process fails, due to some unknown unknown or the HLAI being misaligned or through any of the known failure modes, we lose.
However, 1) seems like a much easier problem than aligning an arbitrary intelligence AI. Even though something could likely go wrong aligning a HLAI, it also seems likely that something goes wrong if we try to align an arbitrary intelligence AI. (This seems related to security mindset… in the best case world we do just solve the general case of alignment, but that seemshard.)
For 2), the process of training up to superintelligence seems like a HLAI would help more than it hurts. If the HLAI is actually intent aligned, this seems like having a fully uploaded alignment researcher, which seems less like getting Godzilla to fight and more like getting a Jaeger to protect Tokyo.
The relevant point isn’t “the realistic world, where the clever scheme fell apart”, the relevant point is “the realistic world, where there is some probability of the clever scheme falling apart, and you need to calculate the expectation of that probability, and that expectation could conceivably go down when you add Godzilla”.
Or to put it another way, even if the worst case is as bad, the average case could still be better. Analyzing the situation in terms of “what if the clever plan fails” is looking only at the worst case.
In-universe, Mecha-Godzilla had to be built with a Godzilla-skeleton, which caused both to turn against Humanity.
It feels probable that there will be substantial technical similarities between Production Superintelligences and Alignment Superintelligences, which could cause both of them to turn against us.
(Epistemic state: Low confidence)
Ok, but why isn’t it better to have Godzilla fighting Mega-Godzilla instead of leaving Mega-Godzilla unchallenged?
Because Tokyo still gets destroyed.
Important thing to bear in mind here: the relevant point for comparison is not the fantasy-world where the Godzilla-vs-Mega-Godzilla fight happens exactly the way the clever elaborate scheme imagined. The relevant point for comparison is the realistic-world where something went wrong, and the elaborate clever scheme fell apart, and now there’s monsters rampaging around anyway.
I would file this under
Epistemic Status: Low. Very likely wrong but would like to understand why.
It seems easier to intent align a human level or slightly above human level AI (HLAI) than a massively smarter than human AI.
Some new research options become available to us once we have aligned HLAI, including:
The HLAI might be able to directly help us do alignment research and solve the general alignment problem.
We could run experiments on the HLAI and get experimental evidence much closer to the domain we are actually trying to solve.
We could use the HLAI to start a training procedure, a la IDA.
These schemes seem fragile, because 1) if any HLAIs are not aligned, we lose, and 2) if the training up to superintelligence process fails, due to some unknown unknown or the HLAI being misaligned or through any of the known failure modes, we lose.
However, 1) seems like a much easier problem than aligning an arbitrary intelligence AI. Even though something could likely go wrong aligning a HLAI, it also seems likely that something goes wrong if we try to align an arbitrary intelligence AI. (This seems related to security mindset… in the best case world we do just solve the general case of alignment, but that seems hard.)
For 2), the process of training up to superintelligence seems like a HLAI would help more than it hurts. If the HLAI is actually intent aligned, this seems like having a fully uploaded alignment researcher, which seems less like getting Godzilla to fight and more like getting a Jaeger to protect Tokyo.
The relevant point isn’t “the realistic world, where the clever scheme fell apart”, the relevant point is “the realistic world, where there is some probability of the clever scheme falling apart, and you need to calculate the expectation of that probability, and that expectation could conceivably go down when you add Godzilla”.
Or to put it another way, even if the worst case is as bad, the average case could still be better. Analyzing the situation in terms of “what if the clever plan fails” is looking only at the worst case.
In-universe, Mecha-Godzilla had to be built with a Godzilla-skeleton, which caused both to turn against Humanity.
It feels probable that there will be substantial technical similarities between Production Superintelligences and Alignment Superintelligences, which could cause both of them to turn against us.
(Epistemic state: Low confidence)