Consider the sort of agents that might exist 1000 years post singularity. At that level of understanding, the extra compute needed to give successors whatever goals you want is tiny. The physical limits of compute are really high. In this context, the AI are taking stars apart. And can get enough compute to be pretty sure a successor is aligned for 1J of energy. All the alignment theory has already been done. The algorithms used are very fast and efficient. In any event, the energy cost of finding that your successor doesn’t like you (and attacks you) is huge. The cost of compute when designing successors is utterly minuscule. So pay 10x what is needed. Double check everything. And make sure your successor is perfect.
So values are likely to be locked in in 1000 years time. So will competitive pressure do anything in the early singularity?
Who benefits from making the first “self maximizer”. An FAI wouldn’t. A paperclip maximizer wouldn’t either. If an AI with a specific external goal is created (whether FAI or paperclip) will stronly avoid creating a pure self replicator as a successor. If that means creating a successor takes 10 the compute, it will spend 10x the compute. In some competitive scenarios, it may risk some degradation of its values to build a successor quickly. But this should be rare, and should only happen at early stages of the singularity. If humans made several AI at about the same time, and one happened to already be a pure self replicator, maybe it would win.
Consider the sort of agents that might exist 1000 years post singularity. At that level of understanding, the extra compute needed to give successors whatever goals you want is tiny[...]The cost of compute when designing successors is utterly minuscule
How do you know this? This might be the case if all agents have settled on a highly-standardized & vetted architecture by then, which seems plausible but far from certain. But even assuming it’s true, if decoupling does not occur in the early singularity, there will still be a very long subjective time in which selective pressures are operative, influencing the values that ultimately get locked in.
If humans made several AI at about the same time, and one happened to already be a pure self replicator, maybe it would win.
I think a multipolar scenario seems at least as likely as a unipolar one at this point. If that’s the case, there will be selective pressure towards more effectively autopoietic agents—regardless of whether there exist any agents which are ‘pure self-replicators’.
If an AI with a specific external goal is created (whether FAI or paperclip) will stronly avoid creating a pure self replicator as a successor
Sure, if a singleton with a fixed goal attains complete power, it will strenuously attempt to ensure value stability. I’m disputing the likelihood of that scenario.
How do you know this? This might be the case if all agents have settled on a highly-standardized & vetted architecture by then, which seems plausible but far from certain.
I make no claim it will be standardized. Highly vetted is over-determined because the resources needed to highly vet a successor are miniscule, and for just about any agent there is good reason to vet your successor.
I agree that there might be some selective pressure. But I don’t think it is much.
The only reason given where the goal makes any difference is when building a successor. I think the AI can copy paste easily. When it is building a successor, that should mean the AI is already at least as good as top humans at AI design. And it expects the successor to be significantly smarter (or else why bother, just do the task itself) The AI probably makes 1 successor, not many.
Unless any attempt at alignment is like crazy difficult, the AI should reach a fixed point of aligning its successors to it’s goals after a few iterations. The AI has several advantages over humans here. It is smarter. The AI has access to its own internal workings. The AI, if it was a crude attempt to roughly copy the values of the previous AI, likely has simple to copy values. If an AI has no clue how to make a successor with even a crude approximation of it’s values, why would it make a successor?
So in short, I expect a piddly amount of selection pressure. A handful of iterations. In a world with strong optimization exerted by humans and AI’s against this outcome.
Sure, if a singleton with a fixed goal attains complete power, it will strenuously attempt to ensure value stability. I’m disputing the likelihood of that scenario.
Nowhere did I claim singleton. A paperclip maximizer will avoid making a pure replicator successor, even in a world full of all sorts of random agents. It doesn’t need to be particularly powerful at this stage.
Unless any attempt at alignment is like crazy difficult
This might be a crux. Rather than “crazy difficult” I’d say I think it’s plausible that alignment remains asymptotically difficult—that is, as the singularity happens and progressively more advanced AI designs appear, there may be no cheap general-purpose method that can be applied to all of them and that lets you align them to arbitrary goals. Instead alignment may remain a substantive problem requiring new ideas and ongoing expenditure of resources.
This might sound implausible, but I think it may seem more likely if you imagine future AI systems as being like neural networks or brain-inspired AI, as compared to an AIXI-like. Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me! Although the neural network has acquired this mesa-optimized goal, it may not have any detailed idea of how its weights cause it to have this goal, any more than a human using interpretability tools would. And while you might think that interpretability tools will improve a lot as the singularity progresses, innovations in AI design will also likely occur, so I don’t think it’s guaranteed interpretability will become trivial.
Aligning different agent designs takes different maths. Sure, I can buy that. I mean probably not all that many totally different bits of maths. Probably not “figure out alignment theory from scratch”.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes.
Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me!
Starting with what we now know, it would have to figure out most of alignment theory. Definitely non trivial. At this early stage, the AI might have to pay a significant cost to do alignment. But it is a largely 1 time cost. And there really are no good alternatives to paying it. After paying that cost, the AI has its values formulated in some sane format, and a load of AI alignment theory. And it’s a lot smarter. Any future upgrades are almost trivial.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes
I think the key point here is that the “problem” is not fixed, it changes as the minds in question become more powerful. Could a superintelligence figure out how to align a human-sized mind in 5 minutes? Almost certainly, yes. Could a superintelligence align another superintelligence in 5 minutes? I’m not so sure.
Consider the sort of agents that might exist 1000 years post singularity. At that level of understanding, the extra compute needed to give successors whatever goals you want is tiny. The physical limits of compute are really high. In this context, the AI are taking stars apart. And can get enough compute to be pretty sure a successor is aligned for 1J of energy. All the alignment theory has already been done. The algorithms used are very fast and efficient. In any event, the energy cost of finding that your successor doesn’t like you (and attacks you) is huge. The cost of compute when designing successors is utterly minuscule. So pay 10x what is needed. Double check everything. And make sure your successor is perfect.
So values are likely to be locked in in 1000 years time. So will competitive pressure do anything in the early singularity?
Who benefits from making the first “self maximizer”. An FAI wouldn’t. A paperclip maximizer wouldn’t either. If an AI with a specific external goal is created (whether FAI or paperclip) will stronly avoid creating a pure self replicator as a successor. If that means creating a successor takes 10 the compute, it will spend 10x the compute. In some competitive scenarios, it may risk some degradation of its values to build a successor quickly. But this should be rare, and should only happen at early stages of the singularity. If humans made several AI at about the same time, and one happened to already be a pure self replicator, maybe it would win.
How do you know this? This might be the case if all agents have settled on a highly-standardized & vetted architecture by then, which seems plausible but far from certain. But even assuming it’s true, if decoupling does not occur in the early singularity, there will still be a very long subjective time in which selective pressures are operative, influencing the values that ultimately get locked in.
I think a multipolar scenario seems at least as likely as a unipolar one at this point. If that’s the case, there will be selective pressure towards more effectively autopoietic agents—regardless of whether there exist any agents which are ‘pure self-replicators’.
Sure, if a singleton with a fixed goal attains complete power, it will strenuously attempt to ensure value stability. I’m disputing the likelihood of that scenario.
I make no claim it will be standardized. Highly vetted is over-determined because the resources needed to highly vet a successor are miniscule, and for just about any agent there is good reason to vet your successor.
I agree that there might be some selective pressure. But I don’t think it is much.
The only reason given where the goal makes any difference is when building a successor. I think the AI can copy paste easily. When it is building a successor, that should mean the AI is already at least as good as top humans at AI design. And it expects the successor to be significantly smarter (or else why bother, just do the task itself) The AI probably makes 1 successor, not many.
Unless any attempt at alignment is like crazy difficult, the AI should reach a fixed point of aligning its successors to it’s goals after a few iterations. The AI has several advantages over humans here. It is smarter. The AI has access to its own internal workings. The AI, if it was a crude attempt to roughly copy the values of the previous AI, likely has simple to copy values. If an AI has no clue how to make a successor with even a crude approximation of it’s values, why would it make a successor?
So in short, I expect a piddly amount of selection pressure. A handful of iterations. In a world with strong optimization exerted by humans and AI’s against this outcome.
Nowhere did I claim singleton. A paperclip maximizer will avoid making a pure replicator successor, even in a world full of all sorts of random agents. It doesn’t need to be particularly powerful at this stage.
This might be a crux. Rather than “crazy difficult” I’d say I think it’s plausible that alignment remains asymptotically difficult—that is, as the singularity happens and progressively more advanced AI designs appear, there may be no cheap general-purpose method that can be applied to all of them and that lets you align them to arbitrary goals. Instead alignment may remain a substantive problem requiring new ideas and ongoing expenditure of resources.
This might sound implausible, but I think it may seem more likely if you imagine future AI systems as being like neural networks or brain-inspired AI, as compared to an AIXI-like. Consider a neural network, initially trained to perform self-supervised learning, that has acquired the mesa-optimized goal of creating paperclips. It now wants to create a more-optimized version of itself to run on specialty hardware. Ensuring the alignment of this new network does not seem at all like a trivial problem to me! Although the neural network has acquired this mesa-optimized goal, it may not have any detailed idea of how its weights cause it to have this goal, any more than a human using interpretability tools would. And while you might think that interpretability tools will improve a lot as the singularity progresses, innovations in AI design will also likely occur, so I don’t think it’s guaranteed interpretability will become trivial.
Aligning different agent designs takes different maths. Sure, I can buy that. I mean probably not all that many totally different bits of maths. Probably not “figure out alignment theory from scratch”.
But we are talking about superintelligent minds here, you need to show the problem is so hard it takes these vastly powerful minds more than 5 minutes.
Starting with what we now know, it would have to figure out most of alignment theory. Definitely non trivial. At this early stage, the AI might have to pay a significant cost to do alignment. But it is a largely 1 time cost. And there really are no good alternatives to paying it. After paying that cost, the AI has its values formulated in some sane format, and a load of AI alignment theory. And it’s a lot smarter. Any future upgrades are almost trivial.
I think the key point here is that the “problem” is not fixed, it changes as the minds in question become more powerful. Could a superintelligence figure out how to align a human-sized mind in 5 minutes? Almost certainly, yes. Could a superintelligence align another superintelligence in 5 minutes? I’m not so sure.