Thanks, stating (part of) your success story this way makes it easier for me to understand and to come up with additional “ways it could fail”.
Cryptic strategies
The unaligned AI comes up with some kind of long term strategy that the aligned AI can’t observe or can’t understand, for example because the aligned AI is trying to satisfy humans’ short-term preferences and humans can’t observe or understand the unaligned AI’s long term strategy.
Different resources for different goals
The unaligned AI uses up useful resources for human goals to get resources that are useful for itself. Aligned AI copies this and it’s too late when humans figure out what their goals actually are. (Actually this doesn’t apply because you said “This is intended as an interim solution, i.e. you would expect to transition to using a “correct” prior before accessing most of the universe’s resources (say within 1000 years). The point of this approach is to avoiding losing influence during the interim period.” I’ll leave this here anyway to save other people time in case they think of it.)
Trying to kill everyone as a terminal goal
Under “reckless” you say “Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident.” but then you don’t list this as an independent concern. Some humans want to kill everyone (e.g. to eliminate suffering) and so they could build AIs that have this goal.
Time-inconsistent values and other human irrationalities
This may give unaligned AI systems a one-time advantage for influencing the long-term future (if they care more about it) but doesn’t change the basic dynamics of strategy-stealing.
This may be false if humans don’t have time-consistent values. See this and this for examples of such values. (Will have to think about how big of a deal this is, but thought I’d just flag it for now.)
Weird priors
From this comment: Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.
Additional example of 11
This seems like an important example of 11 to state explicitly: The optimal strategy for unaligned AI to gain resources is to use lots of suffering subroutines or commit a lot of “mindcrime”. Or, the unaligned AI deliberately does this just so that you can’t copy its strategy.
The unaligned AI comes up with some kind of long term strategy that the aligned AI can’t observe or can’t understand, for example because the aligned AI is trying to satisfy humans’ short-term preferences and humans can’t observe or understand the unaligned AI’s long term strategy.
I’m not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI. It just uses whatever procedure the unaligned AI originally used to find that strategy.
Trying to kill everyone as a terminal goal
Under “reckless” you say “Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident.” but then you don’t list this as an independent concern. Some humans want to kill everyone (e.g. to eliminate suffering) and so they could build AIs that have this goal.
I agree that people who want a barren universe have an advantage, this is similar to recklessness and fragility but maybe worth separating.
Weird priors
From this comment: Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.
I’m not sure I understand this, but it seems like my earlier response (“I’m not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI”) is relevant.
Or, the unaligned AI deliberately does this just so that you can’t copy its strategy.
I’m not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI. It just uses whatever procedure the unaligned AI originally used to find that strategy.
How? The unaligned AI is presumably applying some kind of planning algorithm to its long-term/terminal goal to find its strategy, but in your scenario isn’t the aligned/corrigble AI just following the short-term/instrumental goals of its human users? How is it able to use the unaligned AI’s strategy-finding procedure?
To make a guess, are you thinking that the user tells the AI “Find a strategy that’s instrumentally useful for a variety of long-term goals, and follow that until further notice?” If so, it’s not literally the same procedure that the unaligned AI uses but you’re hoping it’s close enough?
As a matter of terminology, if you’re not thinking of literally observing and copying strategy, why not call it “strategy matching” instead of “strategy stealing” (which has a strong connotation of literal copying)?
Strategy stealing doesn’t usually involve actual stealing, just using the hypothetical strategy the second player could have used.
How? The unaligned AI is presumably applying some kind of planning algorithm to its long-term/terminal goal to find its strategy, but in your scenario isn’t the aligned/corrigble AI just following the short-term/instrumental goals of its human users? How is it able to use the unaligned AI’s strategy-finding procedure?
This is what alignment is supposed to give you—a procedure that works just as well as the unaligned AI strategy (e.g. by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those—this post is mostly about whether you should expect that to work. You could also use a different set that is equally useful because you are similarly matching the meta-level strategy for discovering useful facts about how to uncover information.)
Strategy stealing doesn’t usually involve actual stealing, just using the hypothetical strategy the second player could have used.
Oh, didn’t realize that it’s an established technical term in game theory.
by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those
What I mean is that the unaligned AI isn’t trying to “acquire influence”, but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn’t have a long-term / terminal goal, so it can’t just “uses whatever procedure the unaligned AI originally used to find that strategy”, at least not literally.
What I mean is that the unaligned AI isn’t trying to “acquire influence”, but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn’t have a long-term / terminal goal, so it can’t just “uses whatever procedure the unaligned AI originally used to find that strategy”, at least not literally.
Yeah, that’s supposed to be the content of the strategy-stealing assumption—that good plans for having a long-term impact can be translated into plans for acquiring flexible influence. I’m interested in looking at ways that can fail. (Alignment is the most salient to me.)
I’m still not sure I understand. Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence? If the latter, what kind of thing do you imagine it actually doing? For example is it trying to “find a strategy that’s instrumentally useful for a variety of long-term goals” as I guessed earlier? (It’s hard for me to help “look for ways that can fail” when this picture isn’t very clear.)
Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence?
The latter
It is trying to find a strategy that’s instrumentally useful for a variety of long-term goals
It’s presumably trying to find a strategy that’s good for the user, but in the worst case where it understands nothing about the user it still shouldn’t do any worse than “find a strategy that’s instrumentally useful for a variety of long-term goals.”
It’s presumably trying to find a strategy that’s good for the user
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user, but here “good for the user” seemingly has to be interpreted as “good in the long run”.
For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.
I think “what behavior is best according to those values” is never going to be robustly corrigible, even if you use a very good model of the user’s preferences and optimize very mildly. It’s just not a good question to be asking.
Do you see why I’m confused here? Is there a way to interpret “trying to find a strategy that’s good for the user” such that the AI is still corrigible?
For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.
Estimating values then optimizing those seems (much) worse than optimizing “what the user wants.” One natural strategy for getting what the user wants can be something like “get into a good position to influence the world and then ask the user later.”
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user, but here “good for the user” seemingly has to be interpreted as “good in the long run”.
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
I don’t understand this statement, in part because I have little idea what “corrigibility to some other definition of value” means, and in part because I don’t know why you bring up this distinction at all, or what a “strong view” here might be about.
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
I don’t understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for “strategy-stealing” that you forgot to list, or am I still misunderstanding something?
ETA: See also this earlier comment where I asked this question in a slightly different way.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to “do better by doing what others are doing”. For example suppose I’m playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there’s nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).
I think I won’t have time to engage much on this in the near future
Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you’ll have something concise and coherent to respond to when you get back to this.
Thanks, stating (part of) your success story this way makes it easier for me to understand and to come up with additional “ways it could fail”.
Cryptic strategies
The unaligned AI comes up with some kind of long term strategy that the aligned AI can’t observe or can’t understand, for example because the aligned AI is trying to satisfy humans’ short-term preferences and humans can’t observe or understand the unaligned AI’s long term strategy.
Different resources for different goals
The unaligned AI uses up useful resources for human goals to get resources that are useful for itself. Aligned AI copies this and it’s too late when humans figure out what their goals actually are.(Actually this doesn’t apply because you said “This is intended as an interim solution, i.e. you would expect to transition to using a “correct” prior before accessing most of the universe’s resources (say within 1000 years). The point of this approach is to avoiding losing influence during the interim period.” I’ll leave this here anyway to save other people time in case they think of it.)Trying to kill everyone as a terminal goal
Under “reckless” you say “Overall I think this isn’t a big deal, because it seems much easier to cause extinction by trying to kill everyone than as an accident.” but then you don’t list this as an independent concern. Some humans want to kill everyone (e.g. to eliminate suffering) and so they could build AIs that have this goal.
Time-inconsistent values and other human irrationalities
This may be false if humans don’t have time-consistent values. See this and this for examples of such values. (Will have to think about how big of a deal this is, but thought I’d just flag it for now.)
Weird priors
From this comment: Here’s a possible way for another AI (A) to exploit your AI (B). Search for a statement S such that B can’t consult its human about S’s prior and P(A will win a future war against B | S) is high. Then adopt a high prior for S, wait for B to do the same, and come to B to negotiate a deal that greatly favors A.
Additional example of 11
This seems like an important example of 11 to state explicitly: The optimal strategy for unaligned AI to gain resources is to use lots of suffering subroutines or commit a lot of “mindcrime”. Or, the unaligned AI deliberately does this just so that you can’t copy its strategy.
I’m not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI. It just uses whatever procedure the unaligned AI originally used to find that strategy.
I agree that people who want a barren universe have an advantage, this is similar to recklessness and fragility but maybe worth separating.
I’m not sure I understand this, but it seems like my earlier response (“I’m not imagining that the aligned AI literally observes and copies the strategy of the unaligned AI”) is relevant.
It’s not clear to me whether this is possible.
How? The unaligned AI is presumably applying some kind of planning algorithm to its long-term/terminal goal to find its strategy, but in your scenario isn’t the aligned/corrigble AI just following the short-term/instrumental goals of its human users? How is it able to use the unaligned AI’s strategy-finding procedure?
To make a guess, are you thinking that the user tells the AI “Find a strategy that’s instrumentally useful for a variety of long-term goals, and follow that until further notice?” If so, it’s not literally the same procedure that the unaligned AI uses but you’re hoping it’s close enough?
As a matter of terminology, if you’re not thinking of literally observing and copying strategy, why not call it “strategy matching” instead of “strategy stealing” (which has a strong connotation of literal copying)?
Strategy stealing doesn’t usually involve actual stealing, just using the hypothetical strategy the second player could have used.
This is what alignment is supposed to give you—a procedure that works just as well as the unaligned AI strategy (e.g. by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those—this post is mostly about whether you should expect that to work. You could also use a different set that is equally useful because you are similarly matching the meta-level strategy for discovering useful facts about how to uncover information.)
Oh, didn’t realize that it’s an established technical term in game theory.
What I mean is that the unaligned AI isn’t trying to “acquire influence”, but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn’t have a long-term / terminal goal, so it can’t just “uses whatever procedure the unaligned AI originally used to find that strategy”, at least not literally.
Yeah, that’s supposed to be the content of the strategy-stealing assumption—that good plans for having a long-term impact can be translated into plans for acquiring flexible influence. I’m interested in looking at ways that can fail. (Alignment is the most salient to me.)
I’m still not sure I understand. Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence? If the latter, what kind of thing do you imagine it actually doing? For example is it trying to “find a strategy that’s instrumentally useful for a variety of long-term goals” as I guessed earlier? (It’s hard for me to help “look for ways that can fail” when this picture isn’t very clear.)
The latter
It’s presumably trying to find a strategy that’s good for the user, but in the worst case where it understands nothing about the user it still shouldn’t do any worse than “find a strategy that’s instrumentally useful for a variety of long-term goals.”
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user, but here “good for the user” seemingly has to be interpreted as “good in the long run”.
In Universality and Security Amplification you said:
then in a later comment:
Do you see why I’m confused here? Is there a way to interpret “trying to find a strategy that’s good for the user” such that the AI is still corrigible?
Estimating values then optimizing those seems (much) worse than optimizing “what the user wants.” One natural strategy for getting what the user wants can be something like “get into a good position to influence the world and then ask the user later.”
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
I don’t understand this statement, in part because I have little idea what “corrigibility to some other definition of value” means, and in part because I don’t know why you bring up this distinction at all, or what a “strong view” here might be about.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
I don’t understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for “strategy-stealing” that you forgot to list, or am I still misunderstanding something?
ETA: See also this earlier comment where I asked this question in a slightly different way.
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to “do better by doing what others are doing”. For example suppose I’m playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there’s nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).
Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you’ll have something concise and coherent to respond to when you get back to this.