Strategy stealing doesn’t usually involve actual stealing, just using the hypothetical strategy the second player could have used.
How? The unaligned AI is presumably applying some kind of planning algorithm to its long-term/terminal goal to find its strategy, but in your scenario isn’t the aligned/corrigble AI just following the short-term/instrumental goals of its human users? How is it able to use the unaligned AI’s strategy-finding procedure?
This is what alignment is supposed to give you—a procedure that works just as well as the unaligned AI strategy (e.g. by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those—this post is mostly about whether you should expect that to work. You could also use a different set that is equally useful because you are similarly matching the meta-level strategy for discovering useful facts about how to uncover information.)
Strategy stealing doesn’t usually involve actual stealing, just using the hypothetical strategy the second player could have used.
Oh, didn’t realize that it’s an established technical term in game theory.
by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those
What I mean is that the unaligned AI isn’t trying to “acquire influence”, but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn’t have a long-term / terminal goal, so it can’t just “uses whatever procedure the unaligned AI originally used to find that strategy”, at least not literally.
What I mean is that the unaligned AI isn’t trying to “acquire influence”, but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn’t have a long-term / terminal goal, so it can’t just “uses whatever procedure the unaligned AI originally used to find that strategy”, at least not literally.
Yeah, that’s supposed to be the content of the strategy-stealing assumption—that good plans for having a long-term impact can be translated into plans for acquiring flexible influence. I’m interested in looking at ways that can fail. (Alignment is the most salient to me.)
I’m still not sure I understand. Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence? If the latter, what kind of thing do you imagine it actually doing? For example is it trying to “find a strategy that’s instrumentally useful for a variety of long-term goals” as I guessed earlier? (It’s hard for me to help “look for ways that can fail” when this picture isn’t very clear.)
Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence?
The latter
It is trying to find a strategy that’s instrumentally useful for a variety of long-term goals
It’s presumably trying to find a strategy that’s good for the user, but in the worst case where it understands nothing about the user it still shouldn’t do any worse than “find a strategy that’s instrumentally useful for a variety of long-term goals.”
It’s presumably trying to find a strategy that’s good for the user
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user, but here “good for the user” seemingly has to be interpreted as “good in the long run”.
For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.
I think “what behavior is best according to those values” is never going to be robustly corrigible, even if you use a very good model of the user’s preferences and optimize very mildly. It’s just not a good question to be asking.
Do you see why I’m confused here? Is there a way to interpret “trying to find a strategy that’s good for the user” such that the AI is still corrigible?
For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.
Estimating values then optimizing those seems (much) worse than optimizing “what the user wants.” One natural strategy for getting what the user wants can be something like “get into a good position to influence the world and then ask the user later.”
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user, but here “good for the user” seemingly has to be interpreted as “good in the long run”.
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
I don’t understand this statement, in part because I have little idea what “corrigibility to some other definition of value” means, and in part because I don’t know why you bring up this distinction at all, or what a “strong view” here might be about.
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
I don’t understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for “strategy-stealing” that you forgot to list, or am I still misunderstanding something?
ETA: See also this earlier comment where I asked this question in a slightly different way.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to “do better by doing what others are doing”. For example suppose I’m playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there’s nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).
I think I won’t have time to engage much on this in the near future
Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you’ll have something concise and coherent to respond to when you get back to this.
Strategy stealing doesn’t usually involve actual stealing, just using the hypothetical strategy the second player could have used.
This is what alignment is supposed to give you—a procedure that works just as well as the unaligned AI strategy (e.g. by updating on all the same logical facts about how to acquire influence that the unaligned AI might discover and then using those—this post is mostly about whether you should expect that to work. You could also use a different set that is equally useful because you are similarly matching the meta-level strategy for discovering useful facts about how to uncover information.)
Oh, didn’t realize that it’s an established technical term in game theory.
What I mean is that the unaligned AI isn’t trying to “acquire influence”, but rather trying to accomplish a specific long-term / terminal goal. The aligned AI doesn’t have a long-term / terminal goal, so it can’t just “uses whatever procedure the unaligned AI originally used to find that strategy”, at least not literally.
Yeah, that’s supposed to be the content of the strategy-stealing assumption—that good plans for having a long-term impact can be translated into plans for acquiring flexible influence. I’m interested in looking at ways that can fail. (Alignment is the most salient to me.)
I’m still not sure I understand. Is the aligned AI literally applying a planning algorithm to the same long-term goal as the unaligned AI, and then translating that plan into a plan for acquiring flexible influence, or is it just generally trying to come up with a plan to acquire flexible influence? If the latter, what kind of thing do you imagine it actually doing? For example is it trying to “find a strategy that’s instrumentally useful for a variety of long-term goals” as I guessed earlier? (It’s hard for me to help “look for ways that can fail” when this picture isn’t very clear.)
The latter
It’s presumably trying to find a strategy that’s good for the user, but in the worst case where it understands nothing about the user it still shouldn’t do any worse than “find a strategy that’s instrumentally useful for a variety of long-term goals.”
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user, but here “good for the user” seemingly has to be interpreted as “good in the long run”.
In Universality and Security Amplification you said:
then in a later comment:
Do you see why I’m confused here? Is there a way to interpret “trying to find a strategy that’s good for the user” such that the AI is still corrigible?
Estimating values then optimizing those seems (much) worse than optimizing “what the user wants.” One natural strategy for getting what the user wants can be something like “get into a good position to influence the world and then ask the user later.”
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
I don’t understand this statement, in part because I have little idea what “corrigibility to some other definition of value” means, and in part because I don’t know why you bring up this distinction at all, or what a “strong view” here might be about.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
I don’t understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for “strategy-stealing” that you forgot to list, or am I still misunderstanding something?
ETA: See also this earlier comment where I asked this question in a slightly different way.
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to “do better by doing what others are doing”. For example suppose I’m playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there’s nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).
Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you’ll have something concise and coherent to respond to when you get back to this.