Unless any attempt at alignment is like crazy difficult
This might be a crux. Rather than “crazy difficult” I’d say I think it’s plausible that alignment remains asymptotically difficult—that is, as the singularity happens and progressively more advanced AI designs appear, there may be no cheap general-purpose method that can be applied to all of them and that lets you align them to arbitrary goals. Instead alignment may remain a substantive problem requiring new ideas and ongoing expenditure of resources.
This might sound implausible, but I think it may seem more likely if you imagine future AI systems as being like neural networks or brain-inspired AI, as compared to an AIXI-like. Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me! Although the neural network has acquired this mesa-optimized goal, it may not have any detailed idea of how its weights cause it to have this goal, any more than a human using interpretability tools would. And while you might think that interpretability tools will improve a lot as the singularity progresses, innovations in AI design will also likely occur, so I don’t think it’s guaranteed interpretability will become trivial.
Aligning different agent designs takes different maths. Sure, I can buy that. I mean probably not all that many totally different bits of maths. Probably not “figure out alignment theory from scratch”.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes.
Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me!
Starting with what we now know, it would have to figure out most of alignment theory. Definitely non trivial. At this early stage, the AI might have to pay a significant cost to do alignment. But it is a largely 1 time cost. And there really are no good alternatives to paying it. After paying that cost, the AI has its values formulated in some sane format, and a load of AI alignment theory. And it’s a lot smarter. Any future upgrades are almost trivial.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes
I think the key point here is that the “problem” is not fixed, it changes as the minds in question become more powerful. Could a superintelligence figure out how to align a human-sized mind in 5 minutes? Almost certainly, yes. Could a superintelligence align another superintelligence in 5 minutes? I’m not so sure.
This might be a crux. Rather than “crazy difficult” I’d say I think it’s plausible that alignment remains asymptotically difficult—that is, as the singularity happens and progressively more advanced AI designs appear, there may be no cheap general-purpose method that can be applied to all of them and that lets you align them to arbitrary goals. Instead alignment may remain a substantive problem requiring new ideas and ongoing expenditure of resources.
This might sound implausible, but I think it may seem more likely if you imagine future AI systems as being like neural networks or brain-inspired AI, as compared to an AIXI-like. Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me! Although the neural network has acquired this mesa-optimized goal, it may not have any detailed idea of how its weights cause it to have this goal, any more than a human using interpretability tools would. And while you might think that interpretability tools will improve a lot as the singularity progresses, innovations in AI design will also likely occur, so I don’t think it’s guaranteed interpretability will become trivial.
Aligning different agent designs takes different maths. Sure, I can buy that. I mean probably not all that many totally different bits of maths. Probably not “figure out alignment theory from scratch”.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes.
Starting with what we now know, it would have to figure out most of alignment theory. Definitely non trivial. At this early stage, the AI might have to pay a significant cost to do alignment. But it is a largely 1 time cost. And there really are no good alternatives to paying it. After paying that cost, the AI has its values formulated in some sane format, and a load of AI alignment theory. And it’s a lot smarter. Any future upgrades are almost trivial.
I think the key point here is that the “problem” is not fixed, it changes as the minds in question become more powerful. Could a superintelligence figure out how to align a human-sized mind in 5 minutes? Almost certainly, yes. Could a superintelligence align another superintelligence in 5 minutes? I’m not so sure.