If we’re talking about a blank-slate AI system that doesn’t yet know anything, that then is trained on the negative of the objective we meant, I give it under one in a million that the AI system kills us all before we notice something wrong. (I mean, in all likelihood this would just result in the AI system failing to learn at all, as has happened the many times I’ve done this myself.) The reason I don’t go lower is something like “sufficiently small probabilities are super weird and I should be careful with them”.
Now if you’re instead talking about some AI system that already knows a ton about the world and is very capable and now you “slot in” a programmatic version of the goal and the AI system interprets it literally, then this sort of bug seems possible. But I seriously doubt we’re in that world. And in any case, in that world you should just be worried about us not being able to specify the goal, with this as a special case of that circumstance.
Unfortunately I didn’t have a specific credence beforehand. I felt like the shift was about an order of magnitude, but I didn’t peg the absolute numbers. Thinking back, I probably would have said something like 1⁄3000 give or take 1 order of magnitude. The argument you make pushes me down by an order of magnitude.
I think even a 1 in a million chance is probably way too high for something as bad as this. Partly for acausal trade reasons, though I’m a bit fuzzy on that. It’s high enough to motivate much more attention than is currently being paid to the issue (though I don’t think it means we should abandon normal alignment research! Normal alignment research probably is still more important, I think. But I’m not sure.) Mainly I think that the solution to this problem is very cheap to implement, and thus we do lots of good in expectation by raising more awareness of this problem.
I don’t think you should act on probabilities of 1 in a million when the reason for the probability is “I am uncomfortable using smaller probabilities than that in general”; that seems like a Pascal’s mugging.
Mainly I think that the solution to this problem is very cheap to implement, and thus we do lots of good in expectation by raising more awareness of this problem.
I agree. However, in my case at least the 1/million probability is not for that reason, but for much more concrete reasons, e.g. “It’s already happened at least once, at a major AI company, for an important AI system, yes in the future people will be paying more attention probably but that only changes the probability by an order of magnitude or so.”
Isn’t the cheap solution just… being more cautious about our programming, to catch these bugs before the code starts running? And being more concerned about these signflip errors in general? It’s not like we need to solve Alignment Problem 2.0 to figure out how to prevent signflip. It’s a just an ordinary bug. Like, what happened already with OpenAI could totally have been prevented with an extra hour or so of eyeballs poring over the code, right? (or more accurately, whoever wrote the code in the first place being on the lookout for this kind of error?)
“It’s already happened at least once, at a major AI company, for an important AI system, yes in the future people will be paying more attention probably but that only changes the probability by an order of magnitude or so.”
Tbc, I think it will happen again; I just don’t think it will have a large impact on the world.
Isn’t the cheap solution just… being more cautious about our programming, to catch these bugs before the code starts running? And being more concerned about these signflip errors in general?
If you’re writing the AGI code, sure. But in practice it won’t be you, so you’d have to convince other people to do this. If you tried to do that, I think the primary impact would be “ML researchers are more likely to think AI risk concerns are crazy” which would more than cancel out the potential benefit, even if I believed the risk was 1 in 30,000.
Because you think it’ll be caught in time, etc. Yes. I think it will probably be caught in time too.
OK, so yeah, the solution isn’t quite as cheap as simply “Shout this problem at AI researchers.” It’s gotta be more subtle and respectable than that. Still, I think this is a vastly easier problem to solve than the normal AI alignment problem.
I think it’s also a case of us (or at least me) not yet being convinced that the probability is ⇐ 10^-6. Especially with something as uncertain as this. My credence in such a scenario happening has, too, decreased a fair bit with this thread but I remain unconvinced overall.
And even then, 1 in a million isn’t *that* unlikely—it’s massive compared to the likelihood that a mugger is actually a God. I’m not entirely sure how low it would have to be for me to dismiss it as “Pascalian”, but 1 in a million still feels far too high.
If a mugger actually came up to me and said “I am God and will torture 3^^^3 people unless you pay me $5”, if you then forced me to put a probability on it, I would in fact say something like 1 in a million. I still wouldn’t pay the mugger.
Like, can I actually make a million statements of the same type as that one, and be correct about all but one of them? It’s hard to get that kind of accuracy.
(Here I’m trying to be calibrated with my probabilities, as opposed to saying the thing that would reflect my decision process under expected utility maximization.)
The mugger scenario triggers strong game theoretical intuitions (eg “it’s bad to be the sort of agent that other agents can benefit from making threats against”) and the corresponding evolved decision-making processes. Therefore, when reasoning about scenarios that do not involve game theoretical dynamics (as is the case here), it may be better to use other analogies.
(For the same reason, “Pascal’s mugging” is IMO a bad name for that concept, and “finite Pascal’s wager” would have been better.)
I’d do the same thing for the version about religion (infinite utility from heaven / infinite disutility from hell), where I’m not being exploited, I simply have different beliefs from the person making the argument.
(Note also that the non-exploitability argument isn’t sufficient.)
I think a probability of ~1/30,000 is still way too high for something as bad as this (with near-infinite negative utility). I sincerely hope that it’s much lower.
Where did your credence start out at?
If we’re talking about a blank-slate AI system that doesn’t yet know anything, that then is trained on the negative of the objective we meant, I give it under one in a million that the AI system kills us all before we notice something wrong. (I mean, in all likelihood this would just result in the AI system failing to learn at all, as has happened the many times I’ve done this myself.) The reason I don’t go lower is something like “sufficiently small probabilities are super weird and I should be careful with them”.
Now if you’re instead talking about some AI system that already knows a ton about the world and is very capable and now you “slot in” a programmatic version of the goal and the AI system interprets it literally, then this sort of bug seems possible. But I seriously doubt we’re in that world. And in any case, in that world you should just be worried about us not being able to specify the goal, with this as a special case of that circumstance.
Unfortunately I didn’t have a specific credence beforehand. I felt like the shift was about an order of magnitude, but I didn’t peg the absolute numbers. Thinking back, I probably would have said something like 1⁄3000 give or take 1 order of magnitude. The argument you make pushes me down by an order of magnitude.
I think even a 1 in a million chance is probably way too high for something as bad as this. Partly for acausal trade reasons, though I’m a bit fuzzy on that. It’s high enough to motivate much more attention than is currently being paid to the issue (though I don’t think it means we should abandon normal alignment research! Normal alignment research probably is still more important, I think. But I’m not sure.) Mainly I think that the solution to this problem is very cheap to implement, and thus we do lots of good in expectation by raising more awareness of this problem.
I don’t think you should act on probabilities of 1 in a million when the reason for the probability is “I am uncomfortable using smaller probabilities than that in general”; that seems like a Pascal’s mugging.
Huh? What’s this cheap solution?
I agree. However, in my case at least the 1/million probability is not for that reason, but for much more concrete reasons, e.g. “It’s already happened at least once, at a major AI company, for an important AI system, yes in the future people will be paying more attention probably but that only changes the probability by an order of magnitude or so.”
Isn’t the cheap solution just… being more cautious about our programming, to catch these bugs before the code starts running? And being more concerned about these signflip errors in general? It’s not like we need to solve Alignment Problem 2.0 to figure out how to prevent signflip. It’s a just an ordinary bug. Like, what happened already with OpenAI could totally have been prevented with an extra hour or so of eyeballs poring over the code, right? (or more accurately, whoever wrote the code in the first place being on the lookout for this kind of error?)
Tbc, I think it will happen again; I just don’t think it will have a large impact on the world.
If you’re writing the AGI code, sure. But in practice it won’t be you, so you’d have to convince other people to do this. If you tried to do that, I think the primary impact would be “ML researchers are more likely to think AI risk concerns are crazy” which would more than cancel out the potential benefit, even if I believed the risk was 1 in 30,000.
Because you think it’ll be caught in time, etc. Yes. I think it will probably be caught in time too.
OK, so yeah, the solution isn’t quite as cheap as simply “Shout this problem at AI researchers.” It’s gotta be more subtle and respectable than that. Still, I think this is a vastly easier problem to solve than the normal AI alignment problem.
I think it’s also a case of us (or at least me) not yet being convinced that the probability is ⇐ 10^-6. Especially with something as uncertain as this. My credence in such a scenario happening has, too, decreased a fair bit with this thread but I remain unconvinced overall.
And even then, 1 in a million isn’t *that* unlikely—it’s massive compared to the likelihood that a mugger is actually a God. I’m not entirely sure how low it would have to be for me to dismiss it as “Pascalian”, but 1 in a million still feels far too high.
If a mugger actually came up to me and said “I am God and will torture 3^^^3 people unless you pay me $5”, if you then forced me to put a probability on it, I would in fact say something like 1 in a million. I still wouldn’t pay the mugger.
Like, can I actually make a million statements of the same type as that one, and be correct about all but one of them? It’s hard to get that kind of accuracy.
(Here I’m trying to be calibrated with my probabilities, as opposed to saying the thing that would reflect my decision process under expected utility maximization.)
The mugger scenario triggers strong game theoretical intuitions (eg “it’s bad to be the sort of agent that other agents can benefit from making threats against”) and the corresponding evolved decision-making processes. Therefore, when reasoning about scenarios that do not involve game theoretical dynamics (as is the case here), it may be better to use other analogies.
(For the same reason, “Pascal’s mugging” is IMO a bad name for that concept, and “finite Pascal’s wager” would have been better.)
I’d do the same thing for the version about religion (infinite utility from heaven / infinite disutility from hell), where I’m not being exploited, I simply have different beliefs from the person making the argument.
(Note also that the non-exploitability argument isn’t sufficient.)
I think a probability of ~1/30,000 is still way too high for something as bad as this (with near-infinite negative utility). I sincerely hope that it’s much lower.