I think this is too all-or-nothing about the objectives of the AI system. Following ideas like shard theory, objectives are likely to come in degrees, be numerous and contextually activated, having been messily created by gradient descent.
Because “humans” are probably everywhere in its training data, and because of naiive safety efforts like RLHF, I expect AGI to have a lot of complicated pseudo-objectives / shards relating to humans. These objectives may not be good—and if they are they probably won’t constitute alignment, but I wouldn’t be surprised if it were enough to make it do something more complicated than simply eliminating us for instrumental reasons.
Of course the AI might undergo a reflection process leading to a coherent utility function when it self-improves, but I expect it to be a fairly complicated one, assigning some sort of valence to humans. We might also have some time before it does that, or be able to guide this values-handshake between shards collaboratively.
In the framing of the grandparent comment, that’s an argument that saving humanity will be an objective for plausible AGIs. The purpose of those disclaimers was to discuss the hypothetical where that’s not the case. The post doesn’t appeal to AGI’s motivations, which makes this hypothetical salient.
For LLM simulacra, I think partial alignment by default is likely. But even more generally, misalignment concerns might prevent AGIs with complicated implicit goals from self-improving too quickly (unless they fail alignment and create more powerful AGIs misaligned with them, which also seems likely for LLMs). This difficulty might make them vulnerable to being overtaken by an AGI that has absurdly simple explicitly specified goals, so that keeping itself aligned (with itself) through self-improvement would be much easier for it, and it could undergo recursive self-improvement much more quickly. Those more easily self-improving AGIs probably don’t have humanity in their goals.
I think this is too all-or-nothing about the objectives of the AI system. Following ideas like shard theory, objectives are likely to come in degrees, be numerous and contextually activated, having been messily created by gradient descent.
Because “humans” are probably everywhere in its training data, and because of naiive safety efforts like RLHF, I expect AGI to have a lot of complicated pseudo-objectives / shards relating to humans. These objectives may not be good—and if they are they probably won’t constitute alignment, but I wouldn’t be surprised if it were enough to make it do something more complicated than simply eliminating us for instrumental reasons.
Of course the AI might undergo a reflection process leading to a coherent utility function when it self-improves, but I expect it to be a fairly complicated one, assigning some sort of valence to humans. We might also have some time before it does that, or be able to guide this values-handshake between shards collaboratively.
In the framing of the grandparent comment, that’s an argument that saving humanity will be an objective for plausible AGIs. The purpose of those disclaimers was to discuss the hypothetical where that’s not the case. The post doesn’t appeal to AGI’s motivations, which makes this hypothetical salient.
For LLM simulacra, I think partial alignment by default is likely. But even more generally, misalignment concerns might prevent AGIs with complicated implicit goals from self-improving too quickly (unless they fail alignment and create more powerful AGIs misaligned with them, which also seems likely for LLMs). This difficulty might make them vulnerable to being overtaken by an AGI that has absurdly simple explicitly specified goals, so that keeping itself aligned (with itself) through self-improvement would be much easier for it, and it could undergo recursive self-improvement much more quickly. Those more easily self-improving AGIs probably don’t have humanity in their goals.