How do you know this? This might be the case if all agents have settled on a highly-standardized & vetted architecture by then, which seems plausible but far from certain.
I make no claim it will be standardized. Highly vetted is over-determined because the resources needed to highly vet a successor are miniscule, and for just about any agent there is good reason to vet your successor.
I agree that there might be some selective pressure. But I don’t think it is much.
The only reason given where the goal makes any difference is when building a successor. I think the AI can copy paste easily. When it is building a successor, that should mean the AI is already at least as good as top humans at AI design. And it expects the successor to be significantly smarter (or else why bother, just do the task itself) The AI probably makes 1 successor, not many.
Unless any attempt at alignment is like crazy difficult, the AI should reach a fixed point of aligning its successors to it’s goals after a few iterations. The AI has several advantages over humans here. It is smarter. The AI has access to its own internal workings. The AI, if it was a crude attempt to roughly copy the values of the previous AI, likely has simple to copy values. If an AI has no clue how to make a successor with even a crude approximation of it’s values, why would it make a successor?
So in short, I expect a piddly amount of selection pressure. A handful of iterations. In a world with strong optimization exerted by humans and AI’s against this outcome.
Sure, if a singleton with a fixed goal attains complete power, it will strenuously attempt to ensure value stability. I’m disputing the likelihood of that scenario.
Nowhere did I claim singleton. A paperclip maximizer will avoid making a pure replicator successor, even in a world full of all sorts of random agents. It doesn’t need to be particularly powerful at this stage.
Unless any attempt at alignment is like crazy difficult
This might be a crux. Rather than “crazy difficult” I’d say I think it’s plausible that alignment remains asymptotically difficult—that is, as the singularity happens and progressively more advanced AI designs appear, there may be no cheap general-purpose method that can be applied to all of them and that lets you align them to arbitrary goals. Instead alignment may remain a substantive problem requiring new ideas and ongoing expenditure of resources.
This might sound implausible, but I think it may seem more likely if you imagine future AI systems as being like neural networks or brain-inspired AI, as compared to an AIXI-like. Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me! Although the neural network has acquired this mesa-optimized goal, it may not have any detailed idea of how its weights cause it to have this goal, any more than a human using interpretability tools would. And while you might think that interpretability tools will improve a lot as the singularity progresses, innovations in AI design will also likely occur, so I don’t think it’s guaranteed interpretability will become trivial.
Aligning different agent designs takes different maths. Sure, I can buy that. I mean probably not all that many totally different bits of maths. Probably not “figure out alignment theory from scratch”.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes.
Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me!
Starting with what we now know, it would have to figure out most of alignment theory. Definitely non trivial. At this early stage, the AI might have to pay a significant cost to do alignment. But it is a largely 1 time cost. And there really are no good alternatives to paying it. After paying that cost, the AI has its values formulated in some sane format, and a load of AI alignment theory. And it’s a lot smarter. Any future upgrades are almost trivial.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes
I think the key point here is that the “problem” is not fixed, it changes as the minds in question become more powerful. Could a superintelligence figure out how to align a human-sized mind in 5 minutes? Almost certainly, yes. Could a superintelligence align another superintelligence in 5 minutes? I’m not so sure.
I make no claim it will be standardized. Highly vetted is over-determined because the resources needed to highly vet a successor are miniscule, and for just about any agent there is good reason to vet your successor.
I agree that there might be some selective pressure. But I don’t think it is much.
The only reason given where the goal makes any difference is when building a successor. I think the AI can copy paste easily. When it is building a successor, that should mean the AI is already at least as good as top humans at AI design. And it expects the successor to be significantly smarter (or else why bother, just do the task itself) The AI probably makes 1 successor, not many.
Unless any attempt at alignment is like crazy difficult, the AI should reach a fixed point of aligning its successors to it’s goals after a few iterations. The AI has several advantages over humans here. It is smarter. The AI has access to its own internal workings. The AI, if it was a crude attempt to roughly copy the values of the previous AI, likely has simple to copy values. If an AI has no clue how to make a successor with even a crude approximation of it’s values, why would it make a successor?
So in short, I expect a piddly amount of selection pressure. A handful of iterations. In a world with strong optimization exerted by humans and AI’s against this outcome.
Nowhere did I claim singleton. A paperclip maximizer will avoid making a pure replicator successor, even in a world full of all sorts of random agents. It doesn’t need to be particularly powerful at this stage.
This might be a crux. Rather than “crazy difficult” I’d say I think it’s plausible that alignment remains asymptotically difficult—that is, as the singularity happens and progressively more advanced AI designs appear, there may be no cheap general-purpose method that can be applied to all of them and that lets you align them to arbitrary goals. Instead alignment may remain a substantive problem requiring new ideas and ongoing expenditure of resources.
This might sound implausible, but I think it may seem more likely if you imagine future AI systems as being like neural networks or brain-inspired AI, as compared to an AIXI-like. Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me! Although the neural network has acquired this mesa-optimized goal, it may not have any detailed idea of how its weights cause it to have this goal, any more than a human using interpretability tools would. And while you might think that interpretability tools will improve a lot as the singularity progresses, innovations in AI design will also likely occur, so I don’t think it’s guaranteed interpretability will become trivial.
Aligning different agent designs takes different maths. Sure, I can buy that. I mean probably not all that many totally different bits of maths. Probably not “figure out alignment theory from scratch”.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes.
Starting with what we now know, it would have to figure out most of alignment theory. Definitely non trivial. At this early stage, the AI might have to pay a significant cost to do alignment. But it is a largely 1 time cost. And there really are no good alternatives to paying it. After paying that cost, the AI has its values formulated in some sane format, and a load of AI alignment theory. And it’s a lot smarter. Any future upgrades are almost trivial.
I think the key point here is that the “problem” is not fixed, it changes as the minds in question become more powerful. Could a superintelligence figure out how to align a human-sized mind in 5 minutes? Almost certainly, yes. Could a superintelligence align another superintelligence in 5 minutes? I’m not so sure.