Vladimir_Nesov comments on Are we too confident about unaligned AGI killing off humanity?

Vladimir_Nesov 7 Mar 2023 17:01 UTC
2 points
0
In the framing of the grandparent comment, that’s an argument that saving humanity will be an objective for plausible AGIs. The purpose of those disclaimers was to discuss the hypothetical where that’s not the case. The post doesn’t appeal to AGI’s motivations, which makes this hypothetical salient.

For LLM simulacra, I think partial alignment by default is likely. But even more generally, misalignment concerns might prevent AGIs with complicated implicit goals from self-improving too quickly (unless they fail alignment and create more powerful AGIs misaligned with them, which also seems likely for LLMs). This difficulty might make them vulnerable to being overtaken by an AGI that has absurdly simple explicitly specified goals, so that keeping itself aligned (with itself) through self-improvement would be much easier for it, and it could undergo recursive self-improvement much more quickly. Those more easily self-improving AGIs probably don’t have humanity in their goals.