my current best guess is that gradient descent is going to want to make our models deceptive
Can you quantify your credence in this claim?
Also, how much optimization pressure do you think that we will need to make models not deceptive? More specifically, how would your credence in the above change if we trained with a system that exerted 2x, 4x, … optimization pressure against deception?
If you don’t like these or want a more specific operationalization of this question, I’m happy with whatever you think is likely or filling out more details.
I think it really depends on the specific training setup. Some are much more likely than others to lead to deceptive alignment, in my opinion. Here are some numbers off the top of my head, though please don’t take these too seriously:
~90%: if you keep scaling up RL in complex environments ad infinitum, eventually you get deceptive alignment.
~80%: conditional on RL in complex environments being the first path to transformative AI, there will be deceptively aligned RL models.
~70%: if you keep scaling up GPT-style language modeling ad infinitum, eventually you get deceptive alignment.
~60%: there will be an existential catastrophe due to deceptive alignment specifically.
~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).
For the optimization pressure question, I really don’t know, but I think “2x, 4x” seems too low—that corresponds to only 1-2 bits. It would be pretty surprising to me if the absolute separation between the deceptive and non-deceptive models was that small in either direction for almost any training setup.
~60%: there will be an existential catastrophe due to deceptive alignment specifically.
Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?
In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century
Amongst the LW crowd I’m relatively optimistic, but I’m not that optimistic. I would give maybe 20% total risk of misalignment this century. (I’m generally expecting singularity this century with >75% chance such that most alignment risk ever will be this century.)
The number is lower if you consider “how much alignment risk before AI systems are in the driver’s seat,” which I think is very often the more relevant question, but I’d still put it at 10-20%. At various points in the past my point estimates have ranged from 5% up to 25%.
And then on top of that there are significant other risks from the transition to AI. Maybe a total of more like 40% total existential risk from AI this century? With extinction risk more like half of that, and more uncertain since I’ve thought less about it.
I still find 60% risk from deceptive alignment quite implausible, but wanted to clarify that 10% total risk is not in line with my view and I suspect it is not a typical view on LW or the alignment forum.
And then on top of that there are significant other risks from the transition to AI. Maybe a total of more like 40% total existential risk from AI this century? With extinction risk more like half of that, and more uncertain since I’ve thought less about it.
40% total existential risk, and extinction risk half of that? Does that mean the other half is some kind of existential catastrophe / bad values lock-in but where humans do survive?
Fwiw, I would put non-extinction existential risk at ~80% of all existential risk from AI. So maybe my extinction numbers are actually not too different than Paul’s (seems like we’re both ~20% on extinction specifically).
And then there’s me who was so certain until now that any time people talk about x-risk they mean it to be synonymous with extinction. It does make me curious though, what kind of scenarios are you imagining in which misalignment doesn’t kill everyone? Do more people place a higher credence on s-risk than I originally suspected?
Can you quantify your credence in this claim?
Also, how much optimization pressure do you think that we will need to make models not deceptive? More specifically, how would your credence in the above change if we trained with a system that exerted 2x, 4x, … optimization pressure against deception?
If you don’t like these or want a more specific operationalization of this question, I’m happy with whatever you think is likely or filling out more details.
I think it really depends on the specific training setup. Some are much more likely than others to lead to deceptive alignment, in my opinion. Here are some numbers off the top of my head, though please don’t take these too seriously:
~90%: if you keep scaling up RL in complex environments ad infinitum, eventually you get deceptive alignment.
~80%: conditional on RL in complex environments being the first path to transformative AI, there will be deceptively aligned RL models.
~70%: if you keep scaling up GPT-style language modeling ad infinitum, eventually you get deceptive alignment.
~60%: there will be an existential catastrophe due to deceptive alignment specifically.
~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).
For the optimization pressure question, I really don’t know, but I think “2x, 4x” seems too low—that corresponds to only 1-2 bits. It would be pretty surprising to me if the absolute separation between the deceptive and non-deceptive models was that small in either direction for almost any training setup.
Thank you for putting numbers on it!
Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?
Amongst the LW crowd I’m relatively optimistic, but I’m not that optimistic. I would give maybe 20% total risk of misalignment this century. (I’m generally expecting singularity this century with >75% chance such that most alignment risk ever will be this century.)
The number is lower if you consider “how much alignment risk before AI systems are in the driver’s seat,” which I think is very often the more relevant question, but I’d still put it at 10-20%. At various points in the past my point estimates have ranged from 5% up to 25%.
And then on top of that there are significant other risks from the transition to AI. Maybe a total of more like 40% total existential risk from AI this century? With extinction risk more like half of that, and more uncertain since I’ve thought less about it.
I still find 60% risk from deceptive alignment quite implausible, but wanted to clarify that 10% total risk is not in line with my view and I suspect it is not a typical view on LW or the alignment forum.
40% total existential risk, and extinction risk half of that? Does that mean the other half is some kind of existential catastrophe / bad values lock-in but where humans do survive?
Fwiw, I would put non-extinction existential risk at ~80% of all existential risk from AI. So maybe my extinction numbers are actually not too different than Paul’s (seems like we’re both ~20% on extinction specifically).
And then there’s me who was so certain until now that any time people talk about x-risk they mean it to be synonymous with extinction. It does make me curious though, what kind of scenarios are you imagining in which misalignment doesn’t kill everyone? Do more people place a higher credence on s-risk than I originally suspected?
Unconditional. I’m rather more pessimistic than an overall 10% chance. I usually give ~80% chance of existential risk from AI.