A thoughtful decomposition. If we take the time dimension out and consider AGI just appears ready-to-go I think I would directionally agree with this.
My key assertion is that we will get sub-AGI capable of causing meaningful harm when deliberately used for this purpose significantly ahead of getting full AGI capable of causing meaningful harm through misalignment. I should unpack that a little more:
Alignment primarily becomes a problem when solutions produced by an AI are difficult for a human to comprehensively verify. Stable Diffusion could be embedding hypnotic-inducing mind viruses that will cause all humans to breed cats in an effort to maximise the cute catness of the universe, but nobody seriously thinks this is taking place because the model has no representation of any of those things nor the capability to do so.
Causing harm becomes a problem earlier. Stable Diffusion can be used to cause harm, as can Alpha Fold. Future models that offer more power will have meaningfully larger envelopes for both harm and good.
Given that we will have the harm problem first, we will have to solve it in order to have a strong chance of facing the alignment problem at all.
If, when we face the alignment problem, we have already solved the harm problem, addressing alignment becomes significantly easier and arguably is now a matter of efficiency rather than existential risk.
It’s not quite as straightforward as this, of course, as it’s possible that whatever techniques we come up with for avoiding deliberate harm by sub-AGIs might be subverted by stronger AGIs, but the primary contention of the essay is that assigning a 15% x-risk to alignment implicitly assumes a solution to the harm problem, but this is not currently being invested in to similar or appropriate levels.
In essence, alignment is not unimportant but alignment-first is the wrong order, because to face an alignment x-risk we must first overcome an unstated harm x-risk.
In this formulation, you could argue that the alignment x-risk is 15% conditional on us solving the harm problem, but given current investment in AI safety is dramatically weighted towards alignment and not harm the unconditional alignment x-risk is well below 5% - accounting for the additional outcomes that we may not face it because we fail an AI-harm filter, or because in solving AI-harm we de-risk alignment, or because AI-harm is sufficiently difficult that AI research becomes significantly impacted, slowing or stopping us from reaching the alignment x-risk filter by 2070 (cf global moratoriums on nuclear and biological weapons research, which dramatically slowed progress in those areas).
I agree that there will be potential for harm as people abuse AIs that aren’t quite superintelligent for nefarious purposes. However, in order for that harm to prevent us from facing existential risk due to the control problem, the harm for nefarious use of sub-superintelligent AI itself has to be xrisk-level, and I don’t really see that being the case.
Consider someone consistently giving each new AI release the instructions “become superintelligent and then destroy humanity”. This is not the control problem, but doing this will surely manifest x-risk behaviour at least some degree earlier than when given innocuous instructions?
I think this failure mode would happen extremely close to ordinary AI risk; I don’t think that e.g. solving this failure mode while keeping everything else the same buys you significantly more time to solve the control problem.
I think you may be underestimating the degree to which these models are like kindling, and a powerful reinforcement learner could suddenly slurp all of this stuff up and fuck up the world really badly. I personally don’t think a reinforcement learner that is trying to take over the world would be likely to succeed, but the key worry is that we may be able to create a form of life that, like a plague, is not adapted to the limits of its environment, makes use of forms of fast growth that can take over very quickly, and then crashes most of life in the process.
most folks here also assume that such an agent would be able to survive on its own after it killed us, which I think is very unlikely due to how many orders of magnitude more competent you have to be to run the entire world. gpt3 has been able to give me good initial instructions for how to take over the world when pressured to do so (summary: cyberattacks against infrastructure, then threaten people; this is already considered a standard international threat, and is not newly invented by gpt3), but when I then turned around and pressured it to explain why it was a bad idea, it immediately went into detail about how hard it is to run the entire world—obviously these are all generalizations humans have talked about before, but I still think it’s a solid representation of reality.
that said, because such an agent would be likely also misaligned with itself in my view, I think your perspective that humans who are misaligned with each other (ie, have not successfully deconflicted their agency) are a much greater threat to humanity as a whole.
To the extent that reinforcement models could damage the world or become a self-replicating plague, they will do so much earlier in the takeoff when given direct aligned reward for doing so.
A thoughtful decomposition. If we take the time dimension out and consider AGI just appears ready-to-go I think I would directionally agree with this.
My key assertion is that we will get sub-AGI capable of causing meaningful harm when deliberately used for this purpose significantly ahead of getting full AGI capable of causing meaningful harm through misalignment. I should unpack that a little more:
Alignment primarily becomes a problem when solutions produced by an AI are difficult for a human to comprehensively verify. Stable Diffusion could be embedding hypnotic-inducing mind viruses that will cause all humans to breed cats in an effort to maximise the cute catness of the universe, but nobody seriously thinks this is taking place because the model has no representation of any of those things nor the capability to do so.
Causing harm becomes a problem earlier. Stable Diffusion can be used to cause harm, as can Alpha Fold. Future models that offer more power will have meaningfully larger envelopes for both harm and good.
Given that we will have the harm problem first, we will have to solve it in order to have a strong chance of facing the alignment problem at all.
If, when we face the alignment problem, we have already solved the harm problem, addressing alignment becomes significantly easier and arguably is now a matter of efficiency rather than existential risk.
It’s not quite as straightforward as this, of course, as it’s possible that whatever techniques we come up with for avoiding deliberate harm by sub-AGIs might be subverted by stronger AGIs, but the primary contention of the essay is that assigning a 15% x-risk to alignment implicitly assumes a solution to the harm problem, but this is not currently being invested in to similar or appropriate levels.
In essence, alignment is not unimportant but alignment-first is the wrong order, because to face an alignment x-risk we must first overcome an unstated harm x-risk.
In this formulation, you could argue that the alignment x-risk is 15% conditional on us solving the harm problem, but given current investment in AI safety is dramatically weighted towards alignment and not harm the unconditional alignment x-risk is well below 5% - accounting for the additional outcomes that we may not face it because we fail an AI-harm filter, or because in solving AI-harm we de-risk alignment, or because AI-harm is sufficiently difficult that AI research becomes significantly impacted, slowing or stopping us from reaching the alignment x-risk filter by 2070 (cf global moratoriums on nuclear and biological weapons research, which dramatically slowed progress in those areas).
I agree that there will be potential for harm as people abuse AIs that aren’t quite superintelligent for nefarious purposes. However, in order for that harm to prevent us from facing existential risk due to the control problem, the harm for nefarious use of sub-superintelligent AI itself has to be xrisk-level, and I don’t really see that being the case.
Consider someone consistently giving each new AI release the instructions “become superintelligent and then destroy humanity”. This is not the control problem, but doing this will surely manifest x-risk behaviour at least some degree earlier than when given innocuous instructions?
I think this failure mode would happen extremely close to ordinary AI risk; I don’t think that e.g. solving this failure mode while keeping everything else the same buys you significantly more time to solve the control problem.
I think you may be underestimating the degree to which these models are like kindling, and a powerful reinforcement learner could suddenly slurp all of this stuff up and fuck up the world really badly. I personally don’t think a reinforcement learner that is trying to take over the world would be likely to succeed, but the key worry is that we may be able to create a form of life that, like a plague, is not adapted to the limits of its environment, makes use of forms of fast growth that can take over very quickly, and then crashes most of life in the process.
most folks here also assume that such an agent would be able to survive on its own after it killed us, which I think is very unlikely due to how many orders of magnitude more competent you have to be to run the entire world. gpt3 has been able to give me good initial instructions for how to take over the world when pressured to do so (summary: cyberattacks against infrastructure, then threaten people; this is already considered a standard international threat, and is not newly invented by gpt3), but when I then turned around and pressured it to explain why it was a bad idea, it immediately went into detail about how hard it is to run the entire world—obviously these are all generalizations humans have talked about before, but I still think it’s a solid representation of reality.
that said, because such an agent would be likely also misaligned with itself in my view, I think your perspective that humans who are misaligned with each other (ie, have not successfully deconflicted their agency) are a much greater threat to humanity as a whole.
To the extent that reinforcement models could damage the world or become a self-replicating plague, they will do so much earlier in the takeoff when given direct aligned reward for doing so.