Aligning different agent designs takes different maths. Sure, I can buy that. I mean probably not all that many totally different bits of maths. Probably not “figure out alignment theory from scratch”.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes.
Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me!
Starting with what we now know, it would have to figure out most of alignment theory. Definitely non trivial. At this early stage, the AI might have to pay a significant cost to do alignment. But it is a largely 1 time cost. And there really are no good alternatives to paying it. After paying that cost, the AI has its values formulated in some sane format, and a load of AI alignment theory. And it’s a lot smarter. Any future upgrades are almost trivial.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes
I think the key point here is that the “problem” is not fixed, it changes as the minds in question become more powerful. Could a superintelligence figure out how to align a human-sized mind in 5 minutes? Almost certainly, yes. Could a superintelligence align another superintelligence in 5 minutes? I’m not so sure.
Aligning different agent designs takes different maths. Sure, I can buy that. I mean probably not all that many totally different bits of maths. Probably not “figure out alignment theory from scratch”.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes.
Starting with what we now know, it would have to figure out most of alignment theory. Definitely non trivial. At this early stage, the AI might have to pay a significant cost to do alignment. But it is a largely 1 time cost. And there really are no good alternatives to paying it. After paying that cost, the AI has its values formulated in some sane format, and a load of AI alignment theory. And it’s a lot smarter. Any future upgrades are almost trivial.
I think the key point here is that the “problem” is not fixed, it changes as the minds in question become more powerful. Could a superintelligence figure out how to align a human-sized mind in 5 minutes? Almost certainly, yes. Could a superintelligence align another superintelligence in 5 minutes? I’m not so sure.