If a misaligned AI had 1/trillion “protecting the preferences of whatever weak agents happen to exist in the world”, why couldn’t it also have 1/trillion other vaguely human-like preferences, such as “enjoy watching the suffering of one’s enemies” or “enjoy exercising arbitrary power over others”?
From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I’m very philosophically confused about how to think about all of this.)
I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.
I think it’s totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don’t believe it’s because AI doesn’t care at all one way or the other (such that you should make predictions based on instrumental reasoning like “the AI will kill humans because it’s the easiest way to avoid future conflict” or other relatively small considerations).
I’m worried that people, after reading your top-level comment, will become too little worried about misaligned AI (from their selfish perspective), because it seems like you’re suggesting (conditional on misaligned AI) 50% chance of death and 50% alive and well for a long time (due to 1/trillion kindness), which might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age.
I feel like “misaligned AI kills everyone because it doesn’t care at all” can be a reasonable lie-to-children (for many audiences) since it implies a reasonable amount of concern about misaligned AI (from both selfish and utilitarian perspectives) while the actual all-things-considered case for how much to worry (including things like simulations, acausal trade, anthropics, bigger/infinite universes, quantum/modal immortality, s-risks, 1/trillion values) is just way too complicated and confusing to convey to most people. Do you perhaps disagree and think this simplified message is too alarming?
My objection is that the simplified message is wrong, not that it’s too alarming. I think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone,” while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It’s unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn’t be doubling down on them in the position in argument.
I don’t think misaligned AI drives the majority of s-risk (I’m not even sure that s-risk is higher conditioned on misaligned AI), so I’m not convinced that it’s a super relevant communication consideration here. The future can be scary in plenty of ways other than misaligned AI, and it’s worth discussing those as part of “how excited should we be for faster technological change.”
I regret mentioning “lie-to-children” as it seems a distraction from my main point. (I was trying to introspect/explain why I didn’t feel as motivated to express disagreement with the OP as you, not intending to advocate or endorse anyone going into “the business of telling lies-told-to-children to adults”.)
My main point is that I think “misaligned AI has a 50% chance of killing everyone” isn’t alarming enough, given what I think happens in the remaining 50% of worlds, versus what a typical person is likely to infer from this statement, especially after seeing your top-level comment where you talk about “kindness” at length. Can you try to engage more with this concern? (Apologies if you already did, and I missed your point instead.)
I think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone,” while being a much more reasonable best guess.
(Addressing this since it seems like it might be relevant to my main point.) I find it very puzzling that you think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone”. Intuitively it seems obvious that the latter should be almost twice as alarming as the former. (I tried to find reasons why this intuition might be wrong, but couldn’t.) The difference also seems practically relevant (if by “practically as alarming” you mean the difference is not decision/policy relevant). In the grandparent comment I mentioned that the 50% case “might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age” but you didn’t seem to engage with this.
Yeah, I think “no control over future, 50% you die” is like 70% as alarming as “no control over the future, 90% you die.” Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in “do people really believe this could happen?” or other inputs into decision-making. I think it’s correct to summarize as “practically as alarming.”
I’m not sure what you want engagement with. I don’t think the much worse outcomes are closely related to unaligned AI so I don’t think they seem super relevant to my comment or Nate’s post. Similarly for lots of other reasons the future could be scary or disorienting. I do explicitly flag the loss of control over the future in that same sentence. I think the 50% chance of death is probably in the right ballpark from the perspective of selfish concern about misalignment.
Note that the 50% probability of death includes the possibility of AI having preferences about humans incompatible with our survival. I think the selection pressure for things like spite is radically weaker for the kinds of AI systems produced by ML than for humans (for simple reasons—where is the upside to the AI from spite during training? seems like if you get stuff like threats it will primarily be instrumental rather than a learned instinct) but didn’t really want to get into that in the post.
I do explicitly flag the loss of control over the future in that same sentence.
In your initial comment you talked a lot about AI respecting the preferences of weak agents (using 1/trillion of its resources) which implies handing back control of a lot of resources to humans, which from the selfish or scope insensitive perspective of typical humans probably seems almost as good as not losing that control in the first place.
I don’t think the much worse outcomes are closely related to unaligned AI so I don’t think they seem super relevant to my comment or Nate’s post.
If people think that (conditional on unaligned AI) in 50% of worlds everyone dies and the other 50% of worlds typically look like small utopias where existing humans get to live out long and happy lives (because of 1/trillion kindness), then they’re naturally going to think that aligned AI can only be better than that. So even if s-risks apply almost equally to both aligned and unaligned AI, I still want people to talk about it when talking about unaligned AIs, or take some other measure to ensure that people aren’t potentially misled like this.
(It could be that I’m just worrying too much here, that empirically people who read your top-level comment won’t get the impression that close to 50% of worlds with unaligned AIs will look like small utopias. If this is what you think, I guess we could try to find out, or just leave the discussion here.)
where is the upside to the AI from spite during training?
Maybe the AI develops it naturally from multi-agent training (intended to make the AI more competitive in the real world) or the AI developer tried to train some kind of morality (e.g. sense of fairness or justice) into the AI.
I think “50% you die” is more motivating to people than “90% you die” because in the former, people are likely to be able to increase the absolute chance of survival more, because at 90%, extinction is overdetermined.
I think I tend to base my level of alarm on the log of the severity*probability, not the absolute value. Most of the work is getting enough info to raise a problem to my attention to be worth solving. “Oh no, my house has a decent >30% chance of flooding this week, better do something about it, and I’ll likely enact some preventative measures whether it’s 30% or 80%.” The amount of work I’m going to put into solving it is not twice as much if my odds double, mostly there’s a threshold around whether it’s worth dealing with or not.
Setting that aside, it reads to me like the frame-clash happening here is (loosely) between “50% extinction, 50% not-extinction” and “50% extinction, 50% utopia”, where for the first gamble of course 1:1 odds on extinction is enough to raise it to “we need to solve this damn problem”, but for the second gamble it’s actually much more relevant whether it’s a 1:1 or a 20:1 bet. I’m not sure which one is the relevant one for you two to consider.
Setting that aside, it reads to me like the frame-clash happening here is (loosely) between “50% extinction, 50% not-extinction” and “50% extinction, 50% utopia”
Yeah, I think this is a factor. Paul talked a lot about “1/trillion kindness” as the reason for non-extinction, but 1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives (even better/longer lives than without AI) so it seemed to me like he was (maybe unintentionally) giving the reader a frame of “50% extinction, 50% small utopia”, while still writing other things under the “50% extinction, 50% not-extinction” frame himself.
1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives
Not direct implication, because the AI might have other human-concerning preferences that are larger than 1/trillion. C.f. top-level comment: “I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.”
I’d guess “most humans survive” vs. “most humans die” probabilities don’t correspond super closely to “presence of small pseudo-kindness”. Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
I’d guess “most humans survive” vs. “most humans die” probabilities don’t correspond super closely to “presence of small pseudo-kindness”. Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
Yeah, I think that:
“AI doesn’t care about humans at all so kills them incidentally” is not most of the reason that AIs may kill humans, and my bottom line 50% probability of AI killing us also includes the other paths (AI caring a bit but failing to coordinate to avoid killing humans, conflict during takeover leading to killing lots of humans, AI having scope-sensitive preferences for which not killing humans is a meaningful cost, preserving humans being surprisingly costly, AI having preferences about humans like spite for which human survival is a cost...).
To the extent that its possible to distinguish “intrinsic pseudokindness” from decision-theoretic considerations leading to pseudokindness, I think that decision-theoretic considerations are more important. (I don’t have a strong view on relative importance of ECL and acausal trade, and I think these are hard to disentangle from fuzzier psychological considerations and it all tends to interact.)
AI having scope-sensitive preferences for which not killing humans is a meaningful cost
Could you say more what you mean? If the AI has no discount rate, leaving Earth to the humans may require within a few orders of magnitude 1/trillion kindness. However, if the AI does have a significant discount rate, then delays could be costly to it. Still, the AI could make much more progress in building a Dyson swarm from the moon/Mercury/asteroids with their lower gravity and no atmosphere, allowing the AI to launch material very quickly. My very rough estimate indicates sparing Earth might only delay the AI a month from taking over the universe. That could require a lot of kindness if they have very high discount rates. So maybe training should emphasize the superiority of low discount rates?
Sorry, I meant “scope-insensitive,” and really I just meant an even broader category of like “doesn’t care 10x as much about getting 10x as much stuff.” I think discount rates or any other terminal desire to move fast would count (though for options like “survive in an unpleasant environment for a while” or “freeze and revive later” the required levels of kindness may still be small).
(A month seems roughly right to me as the cost of not trashing Earth’s environment to the point of uninhabitability.)
I don’t think misaligned AI drives the majority of s-risk (I’m not even sure that s-risk is higher conditioned on misaligned AI), so I’m not convinced that it’s a super relevant communication consideration here.
I’m curious what does, in that case; and what proportion affects humans (and currently-existing people or future minds)? Things like spite threat commitments from a misaligned AI warring with humanity seem like a substantial source of s-risk to me.
If a misaligned AI had 1/trillion “protecting the preferences of whatever weak agents happen to exist in the world”, why couldn’t it also have 1/trillion other vaguely human-like preferences, such as “enjoy watching the suffering of one’s enemies” or “enjoy exercising arbitrary power over others”?
From a purely selfish perspective, I think I might prefer that a misaligned AI kills everyone, and take my chances with continuations of myself (my copies/simulations) elsewhere in the multiverse, rather than face whatever the sum-of-desires of the misaligned AI decides to do with humanity. (With the usual caveat that I’m very philosophically confused about how to think about all of this.)
As I said:
I think it’s totally plausible for the AI to care about what happens with humans in a way that conflicts with our own preferences. I just don’t believe it’s because AI doesn’t care at all one way or the other (such that you should make predictions based on instrumental reasoning like “the AI will kill humans because it’s the easiest way to avoid future conflict” or other relatively small considerations).
I’m worried that people, after reading your top-level comment, will become too little worried about misaligned AI (from their selfish perspective), because it seems like you’re suggesting (conditional on misaligned AI) 50% chance of death and 50% alive and well for a long time (due to 1/trillion kindness), which might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age.
I feel like “misaligned AI kills everyone because it doesn’t care at all” can be a reasonable lie-to-children (for many audiences) since it implies a reasonable amount of concern about misaligned AI (from both selfish and utilitarian perspectives) while the actual all-things-considered case for how much to worry (including things like simulations, acausal trade, anthropics, bigger/infinite universes, quantum/modal immortality, s-risks, 1/trillion values) is just way too complicated and confusing to convey to most people. Do you perhaps disagree and think this simplified message is too alarming?
My objection is that the simplified message is wrong, not that it’s too alarming. I think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone,” while being a much more reasonable best guess. I think being wrong is bad for a variety of reasons. It’s unclear if you should ever be in the business of telling lies-told-to-children to adults, but you certainly shouldn’t be doubling down on them in the position in argument.
I don’t think misaligned AI drives the majority of s-risk (I’m not even sure that s-risk is higher conditioned on misaligned AI), so I’m not convinced that it’s a super relevant communication consideration here. The future can be scary in plenty of ways other than misaligned AI, and it’s worth discussing those as part of “how excited should we be for faster technological change.”
I regret mentioning “lie-to-children” as it seems a distraction from my main point. (I was trying to introspect/explain why I didn’t feel as motivated to express disagreement with the OP as you, not intending to advocate or endorse anyone going into “the business of telling lies-told-to-children to adults”.)
My main point is that I think “misaligned AI has a 50% chance of killing everyone” isn’t alarming enough, given what I think happens in the remaining 50% of worlds, versus what a typical person is likely to infer from this statement, especially after seeing your top-level comment where you talk about “kindness” at length. Can you try to engage more with this concern? (Apologies if you already did, and I missed your point instead.)
(Addressing this since it seems like it might be relevant to my main point.) I find it very puzzling that you think “misaligned AI has a 50% chance of killing everyone” is practically as alarming as “misaligned AI has a 95% chance of killing everyone”. Intuitively it seems obvious that the latter should be almost twice as alarming as the former. (I tried to find reasons why this intuition might be wrong, but couldn’t.) The difference also seems practically relevant (if by “practically as alarming” you mean the difference is not decision/policy relevant). In the grandparent comment I mentioned that the 50% case “might not seem so bad compared to keeping AI development on hold indefinitely which potentially implies a high probability of death from old age” but you didn’t seem to engage with this.
Yeah, I think “no control over future, 50% you die” is like 70% as alarming as “no control over the future, 90% you die.” Even if it was only 50% as concerning, all of these differences seem tiny in practice compared to other sources of variation in “do people really believe this could happen?” or other inputs into decision-making. I think it’s correct to summarize as “practically as alarming.”
I’m not sure what you want engagement with. I don’t think the much worse outcomes are closely related to unaligned AI so I don’t think they seem super relevant to my comment or Nate’s post. Similarly for lots of other reasons the future could be scary or disorienting. I do explicitly flag the loss of control over the future in that same sentence. I think the 50% chance of death is probably in the right ballpark from the perspective of selfish concern about misalignment.
Note that the 50% probability of death includes the possibility of AI having preferences about humans incompatible with our survival. I think the selection pressure for things like spite is radically weaker for the kinds of AI systems produced by ML than for humans (for simple reasons—where is the upside to the AI from spite during training? seems like if you get stuff like threats it will primarily be instrumental rather than a learned instinct) but didn’t really want to get into that in the post.
In your initial comment you talked a lot about AI respecting the preferences of weak agents (using 1/trillion of its resources) which implies handing back control of a lot of resources to humans, which from the selfish or scope insensitive perspective of typical humans probably seems almost as good as not losing that control in the first place.
If people think that (conditional on unaligned AI) in 50% of worlds everyone dies and the other 50% of worlds typically look like small utopias where existing humans get to live out long and happy lives (because of 1/trillion kindness), then they’re naturally going to think that aligned AI can only be better than that. So even if s-risks apply almost equally to both aligned and unaligned AI, I still want people to talk about it when talking about unaligned AIs, or take some other measure to ensure that people aren’t potentially misled like this.
(It could be that I’m just worrying too much here, that empirically people who read your top-level comment won’t get the impression that close to 50% of worlds with unaligned AIs will look like small utopias. If this is what you think, I guess we could try to find out, or just leave the discussion here.)
Maybe the AI develops it naturally from multi-agent training (intended to make the AI more competitive in the real world) or the AI developer tried to train some kind of morality (e.g. sense of fairness or justice) into the AI.
I think “50% you die” is more motivating to people than “90% you die” because in the former, people are likely to be able to increase the absolute chance of survival more, because at 90%, extinction is overdetermined.
I think I tend to base my level of alarm on the log of the severity*probability, not the absolute value. Most of the work is getting enough info to raise a problem to my attention to be worth solving. “Oh no, my house has a decent >30% chance of flooding this week, better do something about it, and I’ll likely enact some preventative measures whether it’s 30% or 80%.” The amount of work I’m going to put into solving it is not twice as much if my odds double, mostly there’s a threshold around whether it’s worth dealing with or not.
Setting that aside, it reads to me like the frame-clash happening here is (loosely) between “50% extinction, 50% not-extinction” and “50% extinction, 50% utopia”, where for the first gamble of course 1:1 odds on extinction is enough to raise it to “we need to solve this damn problem”, but for the second gamble it’s actually much more relevant whether it’s a 1:1 or a 20:1 bet. I’m not sure which one is the relevant one for you two to consider.
Yeah, I think this is a factor. Paul talked a lot about “1/trillion kindness” as the reason for non-extinction, but 1/trillion kindness seems to directly imply a small utopia where existing humans get to live out long and happy lives (even better/longer lives than without AI) so it seemed to me like he was (maybe unintentionally) giving the reader a frame of “50% extinction, 50% small utopia”, while still writing other things under the “50% extinction, 50% not-extinction” frame himself.
Not direct implication, because the AI might have other human-concerning preferences that are larger than 1/trillion. C.f. top-level comment: “I’m not talking about whether the AI has spite or other strong preferences that are incompatible with human survival, I’m engaging specifically with the claim that AI is likely to care so little one way or the other that it would prefer just use the humans for atoms.”
I’d guess “most humans survive” vs. “most humans die” probabilities don’t correspond super closely to “presence of small pseudo-kindness”. Because of how other preferences could outweigh that, and because cooperation/bargaining is a big reason for why humans might survive aside from intrinsic preferences.
Yeah, I think that:
“AI doesn’t care about humans at all so kills them incidentally” is not most of the reason that AIs may kill humans, and my bottom line 50% probability of AI killing us also includes the other paths (AI caring a bit but failing to coordinate to avoid killing humans, conflict during takeover leading to killing lots of humans, AI having scope-sensitive preferences for which not killing humans is a meaningful cost, preserving humans being surprisingly costly, AI having preferences about humans like spite for which human survival is a cost...).
To the extent that its possible to distinguish “intrinsic pseudokindness” from decision-theoretic considerations leading to pseudokindness, I think that decision-theoretic considerations are more important. (I don’t have a strong view on relative importance of ECL and acausal trade, and I think these are hard to disentangle from fuzzier psychological considerations and it all tends to interact.)
Could you say more what you mean? If the AI has no discount rate, leaving Earth to the humans may require within a few orders of magnitude 1/trillion kindness. However, if the AI does have a significant discount rate, then delays could be costly to it. Still, the AI could make much more progress in building a Dyson swarm from the moon/Mercury/asteroids with their lower gravity and no atmosphere, allowing the AI to launch material very quickly. My very rough estimate indicates sparing Earth might only delay the AI a month from taking over the universe. That could require a lot of kindness if they have very high discount rates. So maybe training should emphasize the superiority of low discount rates?
Sorry, I meant “scope-insensitive,” and really I just meant an even broader category of like “doesn’t care 10x as much about getting 10x as much stuff.” I think discount rates or any other terminal desire to move fast would count (though for options like “survive in an unpleasant environment for a while” or “freeze and revive later” the required levels of kindness may still be small).
(A month seems roughly right to me as the cost of not trashing Earth’s environment to the point of uninhabitability.)
I’m curious what does, in that case; and what proportion affects humans (and currently-existing people or future minds)? Things like spite threat commitments from a misaligned AI warring with humanity seem like a substantial source of s-risk to me.