By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
I don’t understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for “strategy-stealing” that you forgot to list, or am I still misunderstanding something?
ETA: See also this earlier comment where I asked this question in a slightly different way.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to “do better by doing what others are doing”. For example suppose I’m playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there’s nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).
I think I won’t have time to engage much on this in the near future
Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you’ll have something concise and coherent to respond to when you get back to this.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
I don’t understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for “strategy-stealing” that you forgot to list, or am I still misunderstanding something?
ETA: See also this earlier comment where I asked this question in a slightly different way.
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to “do better by doing what others are doing”. For example suppose I’m playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there’s nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).
Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you’ll have something concise and coherent to respond to when you get back to this.