To be clear, my original claim was for hypothetical scenarios where the failure occurs because the AI didn’t know human values, rather than cases where the AI knows what the human would want but still a failure occurs.
I’m not sure I understand the distinction that you’re drawing here. (It seems like my scenarios could also be interpreted as failures where AI don’t know enough human values, or maybe where humans themselves don’t know enough human values.) What are some examples of what your claim was about?
I do still think they are not as important as intent alignment.
As in, the total expected value lost through such scenarios isn’t as large as the expected value lost through the risk of failing to solve intent alignment? Can you give some ballpark figures of how you see each side of this inequality?
Mostly I’d hope that AI can tell what philosophy is optimized for persuasion
How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?
or at least is capable of presenting counterarguments persuasively as well.
You mean every time you hear a philosophical argument, you ask you AI to produce some counterarguments optimized for persuasion? If so, won’t your friends be afraid to send you any arguments they think of, for fear of your AI superhumanly persuading you to the opposite conclusion?
And I don’t expect a large number of people to explicitly try to lock in their values.
A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn’t they ask their AI for help with this? Or do you imagine them asking for something like “more faith”, but AIs understand human values well enough to not interpret that as “lock in values”?
It seems odd to me that it’s sufficiently competent to successfully reason about simulations enough that an acausal threat can actually be made, but then not competent at reasoning about exotic philosophical cases, and I don’t particularly expect this to happen.
The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality. For example, how much should I care about my copies in the simulations or my subjective future experience, versus the value that would be lost in the base reality if I were to give in to the simulators’ demands? Should I make a counterthreat? Are there any thoughts I or my AI should avoid having, or computations we should avoid doing?
I don’t expect AIs to have clean crisp utility functions of the form “maximize paperclips” (at least initially).
I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.
I expect this to be way less work than the complicated plans that the AI is enacting, so it isn’t a huge competitiveness hit.
I don’t think it’s so much that the coordination involving humans is a lot of work, but rather that we don’t know how to do it in a way that doesn’t cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I’d need to see more details before I update.)
I’m not sure I understand the distinction that you’re drawing here. (It seems like my scenarios could also be interpreted as failures where AI don’t know enough human values, or maybe where humans themselves don’t know enough human values.) What are some examples of what your claim was about?
Examples:
Your AI thinks it’s acceptable to inject you with heroin, because it predicts you will then want more heroin.
Your AI is uncertain whether you’d prefer to explore space or stay on Earth. It randomly guesses that you want to stay on Earth and takes irreversible actions on your behalf that force you to stay on Earth.
In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.
As in, the total expected value lost through such scenarios isn’t as large as the expected value lost through the risk of failing to solve intent alignment?
No, the expected value of marginal effort aimed at solving these problems isn’t as large as the expected value of marginal effort on intent alignment.
(I don’t like talking about “expected value lost” because it’s not always clear what does and doesn’t count as part of that. For example I think it’s nearly inevitable that different people will have different goals and so the future will not be exactly as any one of them desired; should I say that there’s a lot of expected value lost from “coordination problems” for that reason? It seems a bit weird to say that if you think there isn’t a way to regain that “expected value”.)
Can you give some ballpark figures of how you see each side of this inequality?
Uh, idk. It’s not something I have numbers on. But I suppose I can try and make up some very fake numbers for, say, AI persuasion. (Before I actually do the exercise, let me note that I could imagine the exercise coming out with numbers that favor persuasion over intent alignment; this probably won’t change my mind and would instead make me distrust the numbers, but I’ll publish them anyway.)
To change an existentially bad outcome from AI persuasion, I’d imagine first figuring out some solutions, then figuring out how to implement them, and then getting the appropriate people to implement them; seems like you need all of these steps in order to make any difference. (Could be technical or governance though.) It feels especially hard to do so at the moment, given how little we know about future AI capabilities and when each potential capability arrives. Maybe:
Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).
A piece of alignment work now is 10x more likely to target the right problem than a similar piece of work for persuasion.
A piece of alignment work now is currently 10x harder to produce than a similar piece of work for persuasion.
So I guess overall I’m at ~100x, very very roughly?
How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?
Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments. Or it says that there are other counterfactual letters you could have received, such that after you read them you’d be convinced of the opposite position, and then it asks whether you still want to read the letter.
If your AI doesn’t know about the counterarguments, and the letter is persuasive even to the AI, then it seems like you’re hosed, but I’m not sure why to expect that.
A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn’t they ask their AI for help with this? Or do you imagine them asking for something like “more faith”, but AIs understand human values well enough to not interpret that as “lock in values”?
I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.
I’m not entirely sure why you do expect this. Maybe you’re viewing them as consciously optimizing for winning status games + a claim that the optimal policy is to lock in your values? But what if the values rewarded by the status games (as they already seem to, e.g. moving from atheism to anti-racism)? In that case it seems like you don’t want to lock in your values, to better play the status game in the future.
The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality.
If you’ve determined a set of universes to care about, then shouldn’t at least decision theory reduce to mathematical / empirical matters about which decision procedure gets the most value across the set of universes? I do agree that moral questions are still not mathematical / empirical, but I don’t find that all that persuasive. I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.
I’m guessing you feel better about mathematical / empirical reasoning because there’s a ground truth that says when that reasoning is done well. I don’t particularly find the existence of a ground truth to be all that big a deal—it probably helps but doesn’t seem tremendously important.
I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.
Fair enough—I agree this is plausible (though only plausible; it doesn’t seem like we’ve sacrificed everything to Moloch yet).
I don’t think it’s so much that the coordination involving humans is a lot of work, but rather that we don’t know how to do it in a way that doesn’t cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I’d need to see more details before I update.)
If we’re imagining coordination between a billion AIs, that seems less obviously doable. I still think that, if you’ve solved intent alignment, it seems much easier. Democratic elections are not just about coordination—they’re also about alignment, since politicians have to optimize for being re-elected. It seems like you could do a lot better if you didn’t have to worry about the alignment part.
In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.
I see, but I think at least part of the problem with threats is that I’m not sure what I care about, which greatly increases my “attack surface”. For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn’t be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).
Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).
This seems really extreme, if I’m not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?
Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.
Given that humans are liable to be persuaded by bad counterarguments too, I’d be concerned that the AI will always “know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.” Since it’s not safe to actually look the counterarguments found by your own AI, it’s not really helping at all. (Or it makes things worse if the user isn’t very cautious and does look at their AI’s counterarguments and gets persuaded by them.)
I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.
I think most people don’t think very long term and aren’t very rational. They’ll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling “purity spirals” of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)
I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.
This seems plausible to me, but I don’t see how one can have enough confidence in this view that one isn’t very worried about the opposite being true and constituting a significant x-risk.
I broadly agree with the things you’re saying; I think it mostly comes down to the actual numbers we’d assign.
This seems really extreme, if I’m not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?
Yeah, that’s about right. I’d note that it isn’t totally clear what the absolute risk number is meant to capture—one operationalization is that it is P(existential catastrophe occurs, and if we had solved AI persuasion but the world was otherwise exactly the same, then no existential catastrophe occurs) -- I realize I didn’t say exactly this above but that’s the one that is mutually exhaustive across risks, and the one that determines expected value of solving the problem.
To justify the absolute number of 1/1000, I’d note that:
The case seems pretty speculative + conjunctive—you need people to choose to use AI to be very persuasive (instead of, idk, retiring to live in luxury in small insular subcommunities), you’d need the AI to be better at persuasion than defending against persuasion (or for people to choose not to defend), and you’d need this to be so bad that it leads to an existential catastrophe.
I feel like if I talked to lots of people the amount I’ve talked with you / others about AI persuasion (i.e. not very much, but enough to convey a basic idea) I’d end up having 10-300 other risks of similar magnitude and plausibility. Under the operationalization I gave above, these probabilities would be mutually exclusive. So that places an upper bound of 1⁄300 − 1⁄10 on any given problem.
I don’t expect this bound to be tight. For example, if it were tight, that would imply that existential catastrophe is guaranteed. But more importantly, there are lots of worlds in which existential catastrophe is overdetermined because society is terrible at coordinating. If you condition on “existential catastrophe” and “AI persuasion was a big problem”, I update that we were really bad at coordination and so I also think that there would be lots of other problems such that solving persuasion wouldn’t prevent the existential catastrophe. (Whereas alignment feels much more like a direct technical challenge—while there certainly is an update against societal coordination if we get an existential catastrophe with a failure of alignment, the update seems a lot smaller, and so I’m more optimistic that solving alignment means that the existential catastrophe doesn’t happen at all.)
The 100x differential between alignment and persuasion comes mostly because points (2) and (3) above don’t apply to alignment, point (1) applies only in part—given my state of knowledge, the case for alignment failure seems much less speculative (though obviously still speculative), though it is still quite conjunctive.
I see, but I think at least part of the problem with threats is that I’m not sure what I care about, which greatly increases my “attack surface”. For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn’t be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).
That’s a fair point, I agree this is a way in which full knowledge of human values can help avoid potentially significant risks in a way that intent alignment doesn’t.
I’m not sure I understand the distinction that you’re drawing here. (It seems like my scenarios could also be interpreted as failures where AI don’t know enough human values, or maybe where humans themselves don’t know enough human values.) What are some examples of what your claim was about?
As in, the total expected value lost through such scenarios isn’t as large as the expected value lost through the risk of failing to solve intent alignment? Can you give some ballpark figures of how you see each side of this inequality?
How? How would you train an AI to distinguish between philosophy optimized for persuasion, and correct or well-intentioned philosophy that just happens to be very persuasive?
You mean every time you hear a philosophical argument, you ask you AI to produce some counterarguments optimized for persuasion? If so, won’t your friends be afraid to send you any arguments they think of, for fear of your AI superhumanly persuading you to the opposite conclusion?
A lot of people are playing status games where faith/loyalty to their cause/ideology gains them a lot of status points. Why wouldn’t they ask their AI for help with this? Or do you imagine them asking for something like “more faith”, but AIs understand human values well enough to not interpret that as “lock in values”?
The former seems to just require that the AI is good at reasoning about mathematical/empirical matters (e.g., are there many simulations of me actually being run in some universe or set of universes) which I think AIs will be good at by default, whereas dealing with the threats seems to require reasoning about hard philosophical problems like decision theory and morality. For example, how much should I care about my copies in the simulations or my subjective future experience, versus the value that would be lost in the base reality if I were to give in to the simulators’ demands? Should I make a counterthreat? Are there any thoughts I or my AI should avoid having, or computations we should avoid doing?
I expect that AIs (or humans) who are less cautious or who think their values can be easily expressed as utility functions will do this first, and thereby gain an advantage over everyone else and maybe forcing them to follow.
I don’t think it’s so much that the coordination involving humans is a lot of work, but rather that we don’t know how to do it in a way that doesn’t cause a lot of waste, similar to a democratically elected administration implementing a bunch of policies only to be reversed by the next administration that takes power, or lawmakers pursuing pork barrel projects that collectively make almost everyone worse off, or being unable to establish and implement easy policies (see COVID again). (You may well have something in mind that works well in the context of intent aligned AI, but I have a prior that says this class of problems is very difficult in general so I’d need to see more details before I update.)
Examples:
Your AI thinks it’s acceptable to inject you with heroin, because it predicts you will then want more heroin.
Your AI is uncertain whether you’d prefer to explore space or stay on Earth. It randomly guesses that you want to stay on Earth and takes irreversible actions on your behalf that force you to stay on Earth.
In contrast, something like a threat doesn’t count, because you know that the outcome if the threat is executed is not something you want; the problem comes because you don’t know how to act in a way that both disincentivizes threats and also doesn’t lead to (too many) threats being enforced. In particular, the problem is not that you don’t know which outcomes are bad.
No, the expected value of marginal effort aimed at solving these problems isn’t as large as the expected value of marginal effort on intent alignment.
(I don’t like talking about “expected value lost” because it’s not always clear what does and doesn’t count as part of that. For example I think it’s nearly inevitable that different people will have different goals and so the future will not be exactly as any one of them desired; should I say that there’s a lot of expected value lost from “coordination problems” for that reason? It seems a bit weird to say that if you think there isn’t a way to regain that “expected value”.)
Uh, idk. It’s not something I have numbers on. But I suppose I can try and make up some very fake numbers for, say, AI persuasion. (Before I actually do the exercise, let me note that I could imagine the exercise coming out with numbers that favor persuasion over intent alignment; this probably won’t change my mind and would instead make me distrust the numbers, but I’ll publish them anyway.)
To change an existentially bad outcome from AI persuasion, I’d imagine first figuring out some solutions, then figuring out how to implement them, and then getting the appropriate people to implement them; seems like you need all of these steps in order to make any difference. (Could be technical or governance though.) It feels especially hard to do so at the moment, given how little we know about future AI capabilities and when each potential capability arrives. Maybe:
Alignment is 100x more likely to be an existentially risky problem at all (think of this as the ratio between probabilities of existential catastrophe by the given problem assuming no intervention from longtermists).
A piece of alignment work now is 10x more likely to target the right problem than a similar piece of work for persuasion.
A piece of alignment work now is currently 10x harder to produce than a similar piece of work for persuasion.
So I guess overall I’m at ~100x, very very roughly?
Putting on my “what would I do” hat, I’m imagining that the AI doesn’t know that it was specifically optimized to be persuasive, but it does know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments. Or it says that there are other counterfactual letters you could have received, such that after you read them you’d be convinced of the opposite position, and then it asks whether you still want to read the letter.
If your AI doesn’t know about the counterarguments, and the letter is persuasive even to the AI, then it seems like you’re hosed, but I’m not sure why to expect that.
I totally expect them to ask AI for help with such games. I don’t expect (most of) them to lock in their values such that they can’t change their mind.
I’m not entirely sure why you do expect this. Maybe you’re viewing them as consciously optimizing for winning status games + a claim that the optimal policy is to lock in your values? But what if the values rewarded by the status games (as they already seem to, e.g. moving from atheism to anti-racism)? In that case it seems like you don’t want to lock in your values, to better play the status game in the future.
If you’ve determined a set of universes to care about, then shouldn’t at least decision theory reduce to mathematical / empirical matters about which decision procedure gets the most value across the set of universes? I do agree that moral questions are still not mathematical / empirical, but I don’t find that all that persuasive. I expect AIs will be able to do the sort of philosophical reasoning that we do, and the question of whether we should care about simulations seems way way easier than the question about which simulations of me are being run, by whom, and what they want.
I’m guessing you feel better about mathematical / empirical reasoning because there’s a ground truth that says when that reasoning is done well. I don’t particularly find the existence of a ground truth to be all that big a deal—it probably helps but doesn’t seem tremendously important.
Fair enough—I agree this is plausible (though only plausible; it doesn’t seem like we’ve sacrificed everything to Moloch yet).
If we’re imagining coordination between a billion AIs, that seems less obviously doable. I still think that, if you’ve solved intent alignment, it seems much easier. Democratic elections are not just about coordination—they’re also about alignment, since politicians have to optimize for being re-elected. It seems like you could do a lot better if you didn’t have to worry about the alignment part.
I see, but I think at least part of the problem with threats is that I’m not sure what I care about, which greatly increases my “attack surface”. For example, if I knew that negative utilitarianism is definitely wrong, then threats to torture some large number of simulated people wouldn’t be effective on me (e.g., under total utilitarianism, I could use the resources demanded by the attacker to create more than enough happy people to counterbalance whatever they threaten to do).
This seems really extreme, if I’m not misunderstanding you. (My own number is like 1x-5x.) Assuming your intent alignment risk is 10%, your AI persuasion risk is only 1/1000?
Given that humans are liable to be persuaded by bad counterarguments too, I’d be concerned that the AI will always “know that there are other persuasive counterarguments that aren’t being presented, and so it says that it looks one-sided and you might want to look at these other counterarguments.” Since it’s not safe to actually look the counterarguments found by your own AI, it’s not really helping at all. (Or it makes things worse if the user isn’t very cautious and does look at their AI’s counterarguments and gets persuaded by them.)
I think most people don’t think very long term and aren’t very rational. They’ll see some people within their group do AI-enabled value lock-in, get a lot of status reward for it, and emulate that behavior in order to not fall behind and become low status within the group. (This might be a gradual process resembling “purity spirals” of the past, i.e., people ask AI to do more and more things that have the effect of locking in their values, or a sudden wave of explicit value lock-ins.)
This seems plausible to me, but I don’t see how one can have enough confidence in this view that one isn’t very worried about the opposite being true and constituting a significant x-risk.
I broadly agree with the things you’re saying; I think it mostly comes down to the actual numbers we’d assign.
Yeah, that’s about right. I’d note that it isn’t totally clear what the absolute risk number is meant to capture—one operationalization is that it is P(existential catastrophe occurs, and if we had solved AI persuasion but the world was otherwise exactly the same, then no existential catastrophe occurs) -- I realize I didn’t say exactly this above but that’s the one that is mutually exhaustive across risks, and the one that determines expected value of solving the problem.
To justify the absolute number of 1/1000, I’d note that:
The case seems pretty speculative + conjunctive—you need people to choose to use AI to be very persuasive (instead of, idk, retiring to live in luxury in small insular subcommunities), you’d need the AI to be better at persuasion than defending against persuasion (or for people to choose not to defend), and you’d need this to be so bad that it leads to an existential catastrophe.
I feel like if I talked to lots of people the amount I’ve talked with you / others about AI persuasion (i.e. not very much, but enough to convey a basic idea) I’d end up having 10-300 other risks of similar magnitude and plausibility. Under the operationalization I gave above, these probabilities would be mutually exclusive. So that places an upper bound of 1⁄300 − 1⁄10 on any given problem.
I don’t expect this bound to be tight. For example, if it were tight, that would imply that existential catastrophe is guaranteed. But more importantly, there are lots of worlds in which existential catastrophe is overdetermined because society is terrible at coordinating. If you condition on “existential catastrophe” and “AI persuasion was a big problem”, I update that we were really bad at coordination and so I also think that there would be lots of other problems such that solving persuasion wouldn’t prevent the existential catastrophe. (Whereas alignment feels much more like a direct technical challenge—while there certainly is an update against societal coordination if we get an existential catastrophe with a failure of alignment, the update seems a lot smaller, and so I’m more optimistic that solving alignment means that the existential catastrophe doesn’t happen at all.)
The 100x differential between alignment and persuasion comes mostly because points (2) and (3) above don’t apply to alignment, point (1) applies only in part—given my state of knowledge, the case for alignment failure seems much less speculative (though obviously still speculative), though it is still quite conjunctive.
That’s a fair point, I agree this is a way in which full knowledge of human values can help avoid potentially significant risks in a way that intent alignment doesn’t.