For starters, it is probably easier to create a sociopolitical system that bans AGI deployment permanently (or atleast until the social system itself fails), than to empower the right process with any sociopolitical system to decide “okay we will slow down timelines across the globe until we solve alignment and then we will deploy”.
I don’t think I agree with this, and if I had to guess it’s because I think the former is harder than you do, not that the latter is easier. But I also don’t think that’s the same thing I’m talking about—I’m thinking of this more on the abstracted level of “what terminal goal should we have / is acceptable”, where never deploying an AGI at all is close to being as bad as never solving alignment, not the practical mechanics in those situations. In other words, this is talking about which of those two outcomes is more preferable, regardless of the ease of getting there. But I see that a crux to this is thinking the former isn’t significantly easier than the other.
Also deploying AGI before a full solution to alignment can involve s-risk, which is worth considering.
I’m not very worried about this possibility, because it seems like a very small portion of the outcome space—we have to make just enough progress on alignment to make the AIs care about humans being around while not enough to solve the entire problem, which feels like a narrow space.
I feel like there’s a lot more nuance here—superhuman intelligence is not magic and cannot solve all problems. We don’t know the curves for return on intelligence (if you hypothesize an intelligence explosion). And both worlds—with just humans growing and human+AGI systems growing involve rapid exponential growth where we don’t know for sure when the exponential stops and what stable state they end up in.
I don’t think it would solve all our problems (or at least, I don’t think this is necessarily true), but I think it would take care of nearly everything else we have to worry about. If there are bad stable states after that, that just seems like a natural outcome for us at some point anyway, because if the AGI truly is aligned in the strongest sense, we’d only run into problems that are literally unavoidable given our presence and values. Even in that case, you’d have to make the argument that that outcome has a non-negligible chance of being worse than what we currently have, which I don’t think you are.
That being said I can imagine scenarios where humans involving AGIs too early in a value-reflective process can be worse than say, humans just engaging in moral reflection without an AGI. For instance I consider utilitarianism a basically incorrect model of human ethics, however it is possible we hardcode utility functions into an AGI which may force any reflection we do with the help of the AGI to be restricted in certain ways. I don’t mean to debate pros or cons of any specific moral philosophy, it’s just that when we’re deeply confused about some aspects of moral philosophy ourselves it’s difficult to ask an AI to solve that for us without hardcoding certain biases or assumptions into the AI. This problem may be harder than the minimal alignment problem of not killing most humans.
I also think this is a problem outside of moral philosophy—in general, the risk of hardcoding metaphysical, epistemic or technical assumptions into the AI, where we do not even know what assumptions we are smuggling in via doing this. Biological humans might make progress on these questions because we can’t just erase the parts of us that are confused (not without neurosurgery or uploading or something). But we can fail to transmit our confusion to the AI, and the AI might be confident about something that is incorrect or not what we wanted it to believe.
In general, this is a crux for me. I have fairly significant probability on moral realism is false, but in general I conceptualize alignment as “how to reliably make an AI that implements values at all without deception?
I don’t think I agree with this, and if I had to guess it’s because I think the former is harder than you do, not that the latter is easier. But I also don’t think that’s the same thing I’m talking about—I’m thinking of this more on the abstracted level of “what terminal goal should we have / is acceptable”, where never deploying an AGI at all is close to being as bad as never solving alignment, not the practical mechanics in those situations. In other words, this is talking about which of those two outcomes is more preferable, regardless of the ease of getting there. But I see that a crux to this is thinking the former isn’t significantly easier than the other.
I’m not very worried about this possibility, because it seems like a very small portion of the outcome space—we have to make just enough progress on alignment to make the AIs care about humans being around while not enough to solve the entire problem, which feels like a narrow space.
I don’t think it would solve all our problems (or at least, I don’t think this is necessarily true), but I think it would take care of nearly everything else we have to worry about. If there are bad stable states after that, that just seems like a natural outcome for us at some point anyway, because if the AGI truly is aligned in the strongest sense, we’d only run into problems that are literally unavoidable given our presence and values. Even in that case, you’d have to make the argument that that outcome has a non-negligible chance of being worse than what we currently have, which I don’t think you are.
In general, this is a crux for me. I have fairly significant probability on moral realism is false, but in general I conceptualize alignment as “how to reliably make an AI that implements values at all without deception?