If that sort of thing happens, you would turn off the AI system (as OpenAI did in fact do). The AI system is not going to learn so fast that it prevents you from doing so.
This has lowered my credence in such a catastrophe by about an order of magnitude. However, that’s a fairly small update for something like this. I’m still worried.
Maybe some important AI will learn faster than we expect. Maybe the humans in charge will be grossly negligent. Maybe the architecture and training process won’t be such as to involve a period of dumb-misaligned-AI prior to smart-misaligned-AI. Maybe some unlucky coincidence will happen that prevents the humans from noticing or correcting the problem.
If we’re talking about a blank-slate AI system that doesn’t yet know anything, that then is trained on the negative of the objective we meant, I give it under one in a million that the AI system kills us all before we notice something wrong. (I mean, in all likelihood this would just result in the AI system failing to learn at all, as has happened the many times I’ve done this myself.) The reason I don’t go lower is something like “sufficiently small probabilities are super weird and I should be careful with them”.
Now if you’re instead talking about some AI system that already knows a ton about the world and is very capable and now you “slot in” a programmatic version of the goal and the AI system interprets it literally, then this sort of bug seems possible. But I seriously doubt we’re in that world. And in any case, in that world you should just be worried about us not being able to specify the goal, with this as a special case of that circumstance.
Unfortunately I didn’t have a specific credence beforehand. I felt like the shift was about an order of magnitude, but I didn’t peg the absolute numbers. Thinking back, I probably would have said something like 1⁄3000 give or take 1 order of magnitude. The argument you make pushes me down by an order of magnitude.
I think even a 1 in a million chance is probably way too high for something as bad as this. Partly for acausal trade reasons, though I’m a bit fuzzy on that. It’s high enough to motivate much more attention than is currently being paid to the issue (though I don’t think it means we should abandon normal alignment research! Normal alignment research probably is still more important, I think. But I’m not sure.) Mainly I think that the solution to this problem is very cheap to implement, and thus we do lots of good in expectation by raising more awareness of this problem.
I don’t think you should act on probabilities of 1 in a million when the reason for the probability is “I am uncomfortable using smaller probabilities than that in general”; that seems like a Pascal’s mugging.
Mainly I think that the solution to this problem is very cheap to implement, and thus we do lots of good in expectation by raising more awareness of this problem.
I agree. However, in my case at least the 1/million probability is not for that reason, but for much more concrete reasons, e.g. “It’s already happened at least once, at a major AI company, for an important AI system, yes in the future people will be paying more attention probably but that only changes the probability by an order of magnitude or so.”
Isn’t the cheap solution just… being more cautious about our programming, to catch these bugs before the code starts running? And being more concerned about these signflip errors in general? It’s not like we need to solve Alignment Problem 2.0 to figure out how to prevent signflip. It’s a just an ordinary bug. Like, what happened already with OpenAI could totally have been prevented with an extra hour or so of eyeballs poring over the code, right? (or more accurately, whoever wrote the code in the first place being on the lookout for this kind of error?)
“It’s already happened at least once, at a major AI company, for an important AI system, yes in the future people will be paying more attention probably but that only changes the probability by an order of magnitude or so.”
Tbc, I think it will happen again; I just don’t think it will have a large impact on the world.
Isn’t the cheap solution just… being more cautious about our programming, to catch these bugs before the code starts running? And being more concerned about these signflip errors in general?
If you’re writing the AGI code, sure. But in practice it won’t be you, so you’d have to convince other people to do this. If you tried to do that, I think the primary impact would be “ML researchers are more likely to think AI risk concerns are crazy” which would more than cancel out the potential benefit, even if I believed the risk was 1 in 30,000.
Because you think it’ll be caught in time, etc. Yes. I think it will probably be caught in time too.
OK, so yeah, the solution isn’t quite as cheap as simply “Shout this problem at AI researchers.” It’s gotta be more subtle and respectable than that. Still, I think this is a vastly easier problem to solve than the normal AI alignment problem.
I think it’s also a case of us (or at least me) not yet being convinced that the probability is ⇐ 10^-6. Especially with something as uncertain as this. My credence in such a scenario happening has, too, decreased a fair bit with this thread but I remain unconvinced overall.
And even then, 1 in a million isn’t *that* unlikely—it’s massive compared to the likelihood that a mugger is actually a God. I’m not entirely sure how low it would have to be for me to dismiss it as “Pascalian”, but 1 in a million still feels far too high.
If a mugger actually came up to me and said “I am God and will torture 3^^^3 people unless you pay me $5”, if you then forced me to put a probability on it, I would in fact say something like 1 in a million. I still wouldn’t pay the mugger.
Like, can I actually make a million statements of the same type as that one, and be correct about all but one of them? It’s hard to get that kind of accuracy.
(Here I’m trying to be calibrated with my probabilities, as opposed to saying the thing that would reflect my decision process under expected utility maximization.)
The mugger scenario triggers strong game theoretical intuitions (eg “it’s bad to be the sort of agent that other agents can benefit from making threats against”) and the corresponding evolved decision-making processes. Therefore, when reasoning about scenarios that do not involve game theoretical dynamics (as is the case here), it may be better to use other analogies.
(For the same reason, “Pascal’s mugging” is IMO a bad name for that concept, and “finite Pascal’s wager” would have been better.)
I’d do the same thing for the version about religion (infinite utility from heaven / infinite disutility from hell), where I’m not being exploited, I simply have different beliefs from the person making the argument.
(Note also that the non-exploitability argument isn’t sufficient.)
I think a probability of ~1/30,000 is still way too high for something as bad as this (with near-infinite negative utility). I sincerely hope that it’s much lower.
Surely with a sufficiently hard take-off it would be possible for the AI to prevent its turning off? If not, couldn’t the AI just deceive its creators into thinking that no signflip has occurred (e.g. making it look like it’s gaining utility from doing something beneficial to human values when it’s actually losing it). How would we be able to determine that it’s happened before it’s too late?
Further to that, what if this fuck-up happens during an arms race when its creators haven’t put enough time into safety to prevent this type of thing from happening?
In this specific example, the error becomes clear very early on in the training process. The standard control problem issues with advanced AI systems don’t apply in that situation.
As for the arms race example, building an AI system of that sophistication to fight in your conflict is like building a Dyson Sphere to power your refrigerator. Friendly AI isn’t the sort of thing major factions are going to want to fight with each other over. If there’s an arm’s race, either something delightfully improbable and horrible has happened, or it’s an extremely lopsided “race” between a Friendly AI faction and a bunch of terrorist groups.
EDIT (From two months in the future...): I am not implying that such a race would be an automatic win, or even a likely win, for said hypothesized Friendly AI faction. For various reasons, this is most certainly not the case. I’m merely saying that the Friendly AI faction will have vastly more resources than all of its competitors combined, and all of its competitors will be enemies of the world at large, etc.
Addressing this whole situation would require actual nuance. This two month old throw away comment is not the place to put that nuance. And besides, it’s been done before.
Sorry for the dumb question a month after the post, but I’ve just found out about deceptive alignment. Do you think it’s plausible that a signflipped AGI could fake being an FAI in the training stage, just to take a treacherous turn at deployment?
Not really, because it takes time to train the cognitive skills necessary for deception.
You might expect this if your AGI was built with a “capabilities module” and a “goal module” and the capabilities were already present before putting in the goal, but it doesn’t seem like AGI is likely to be built this way.
Not really, because it takes time to train the cognitive skills necessary for deception.
Would that not be the case with *any* form of deceptive alignment, though? Surely it (deceptive alignment) wouldn’t pose a risk at all if that were the case? Sorry in advance for my stupidity.
If that sort of thing happens, you would turn off the AI system (as OpenAI did in fact do). The AI system is not going to learn so fast that it prevents you from doing so.
This has lowered my credence in such a catastrophe by about an order of magnitude. However, that’s a fairly small update for something like this. I’m still worried.
Maybe some important AI will learn faster than we expect. Maybe the humans in charge will be grossly negligent. Maybe the architecture and training process won’t be such as to involve a period of dumb-misaligned-AI prior to smart-misaligned-AI. Maybe some unlucky coincidence will happen that prevents the humans from noticing or correcting the problem.
Where did your credence start out at?
If we’re talking about a blank-slate AI system that doesn’t yet know anything, that then is trained on the negative of the objective we meant, I give it under one in a million that the AI system kills us all before we notice something wrong. (I mean, in all likelihood this would just result in the AI system failing to learn at all, as has happened the many times I’ve done this myself.) The reason I don’t go lower is something like “sufficiently small probabilities are super weird and I should be careful with them”.
Now if you’re instead talking about some AI system that already knows a ton about the world and is very capable and now you “slot in” a programmatic version of the goal and the AI system interprets it literally, then this sort of bug seems possible. But I seriously doubt we’re in that world. And in any case, in that world you should just be worried about us not being able to specify the goal, with this as a special case of that circumstance.
Unfortunately I didn’t have a specific credence beforehand. I felt like the shift was about an order of magnitude, but I didn’t peg the absolute numbers. Thinking back, I probably would have said something like 1⁄3000 give or take 1 order of magnitude. The argument you make pushes me down by an order of magnitude.
I think even a 1 in a million chance is probably way too high for something as bad as this. Partly for acausal trade reasons, though I’m a bit fuzzy on that. It’s high enough to motivate much more attention than is currently being paid to the issue (though I don’t think it means we should abandon normal alignment research! Normal alignment research probably is still more important, I think. But I’m not sure.) Mainly I think that the solution to this problem is very cheap to implement, and thus we do lots of good in expectation by raising more awareness of this problem.
I don’t think you should act on probabilities of 1 in a million when the reason for the probability is “I am uncomfortable using smaller probabilities than that in general”; that seems like a Pascal’s mugging.
Huh? What’s this cheap solution?
I agree. However, in my case at least the 1/million probability is not for that reason, but for much more concrete reasons, e.g. “It’s already happened at least once, at a major AI company, for an important AI system, yes in the future people will be paying more attention probably but that only changes the probability by an order of magnitude or so.”
Isn’t the cheap solution just… being more cautious about our programming, to catch these bugs before the code starts running? And being more concerned about these signflip errors in general? It’s not like we need to solve Alignment Problem 2.0 to figure out how to prevent signflip. It’s a just an ordinary bug. Like, what happened already with OpenAI could totally have been prevented with an extra hour or so of eyeballs poring over the code, right? (or more accurately, whoever wrote the code in the first place being on the lookout for this kind of error?)
Tbc, I think it will happen again; I just don’t think it will have a large impact on the world.
If you’re writing the AGI code, sure. But in practice it won’t be you, so you’d have to convince other people to do this. If you tried to do that, I think the primary impact would be “ML researchers are more likely to think AI risk concerns are crazy” which would more than cancel out the potential benefit, even if I believed the risk was 1 in 30,000.
Because you think it’ll be caught in time, etc. Yes. I think it will probably be caught in time too.
OK, so yeah, the solution isn’t quite as cheap as simply “Shout this problem at AI researchers.” It’s gotta be more subtle and respectable than that. Still, I think this is a vastly easier problem to solve than the normal AI alignment problem.
I think it’s also a case of us (or at least me) not yet being convinced that the probability is ⇐ 10^-6. Especially with something as uncertain as this. My credence in such a scenario happening has, too, decreased a fair bit with this thread but I remain unconvinced overall.
And even then, 1 in a million isn’t *that* unlikely—it’s massive compared to the likelihood that a mugger is actually a God. I’m not entirely sure how low it would have to be for me to dismiss it as “Pascalian”, but 1 in a million still feels far too high.
If a mugger actually came up to me and said “I am God and will torture 3^^^3 people unless you pay me $5”, if you then forced me to put a probability on it, I would in fact say something like 1 in a million. I still wouldn’t pay the mugger.
Like, can I actually make a million statements of the same type as that one, and be correct about all but one of them? It’s hard to get that kind of accuracy.
(Here I’m trying to be calibrated with my probabilities, as opposed to saying the thing that would reflect my decision process under expected utility maximization.)
The mugger scenario triggers strong game theoretical intuitions (eg “it’s bad to be the sort of agent that other agents can benefit from making threats against”) and the corresponding evolved decision-making processes. Therefore, when reasoning about scenarios that do not involve game theoretical dynamics (as is the case here), it may be better to use other analogies.
(For the same reason, “Pascal’s mugging” is IMO a bad name for that concept, and “finite Pascal’s wager” would have been better.)
I’d do the same thing for the version about religion (infinite utility from heaven / infinite disutility from hell), where I’m not being exploited, I simply have different beliefs from the person making the argument.
(Note also that the non-exploitability argument isn’t sufficient.)
I think a probability of ~1/30,000 is still way too high for something as bad as this (with near-infinite negative utility). I sincerely hope that it’s much lower.
All of these worry me as well. It simply doesn’t console me enough to think that we “will probably notice it”.
Surely with a sufficiently hard take-off it would be possible for the AI to prevent its turning off? If not, couldn’t the AI just deceive its creators into thinking that no signflip has occurred (e.g. making it look like it’s gaining utility from doing something beneficial to human values when it’s actually losing it). How would we be able to determine that it’s happened before it’s too late?
Further to that, what if this fuck-up happens during an arms race when its creators haven’t put enough time into safety to prevent this type of thing from happening?
In this specific example, the error becomes clear very early on in the training process. The standard control problem issues with advanced AI systems don’t apply in that situation.
As for the arms race example, building an AI system of that sophistication to fight in your conflict is like building a Dyson Sphere to power your refrigerator. Friendly AI isn’t the sort of thing major factions are going to want to fight with each other over. If there’s an arm’s race, either something delightfully improbable and horrible has happened, or it’s an extremely lopsided “race” between a Friendly AI faction and a bunch of terrorist groups.
EDIT (From two months in the future...): I am not implying that such a race would be an automatic win, or even a likely win, for said hypothesized Friendly AI faction. For various reasons, this is most certainly not the case. I’m merely saying that the Friendly AI faction will have vastly more resources than all of its competitors combined, and all of its competitors will be enemies of the world at large, etc.
Addressing this whole situation would require actual nuance. This two month old throw away comment is not the place to put that nuance. And besides, it’s been done before.
Can we be sure that we’d pick it up during the training process, though? And would it be possible for it to happen after the training process?
That’s a bold assumption to make...
Sorry for the dumb question a month after the post, but I’ve just found out about deceptive alignment. Do you think it’s plausible that a signflipped AGI could fake being an FAI in the training stage, just to take a treacherous turn at deployment?
Not really, because it takes time to train the cognitive skills necessary for deception.
You might expect this if your AGI was built with a “capabilities module” and a “goal module” and the capabilities were already present before putting in the goal, but it doesn’t seem like AGI is likely to be built this way.
Would that not be the case with *any* form of deceptive alignment, though? Surely it (deceptive alignment) wouldn’t pose a risk at all if that were the case? Sorry in advance for my stupidity.