For example, suppose meta-execution asks the subquestion “What does the user want?”, gets a representation of their values, and then asks the subquestion “What behavior is best according to those values?” I’ve then generated incorrigible behavior by accident, after taking innocuous steps.
Estimating values then optimizing those seems (much) worse than optimizing “what the user wants.” One natural strategy for getting what the user wants can be something like “get into a good position to influence the world and then ask the user later.”
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
This is very confusing because elsewhere you say that the kind of AI you’re trying to design is just satisfying short-term preferences / instrumental values of the user, but here “good for the user” seemingly has to be interpreted as “good in the long run”.
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
I don’t understand this statement, in part because I have little idea what “corrigibility to some other definition of value” means, and in part because I don’t know why you bring up this distinction at all, or what a “strong view” here might be about.
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
I don’t understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for “strategy-stealing” that you forgot to list, or am I still misunderstanding something?
ETA: See also this earlier comment where I asked this question in a slightly different way.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to “do better by doing what others are doing”. For example suppose I’m playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there’s nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).
I think I won’t have time to engage much on this in the near future
Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you’ll have something concise and coherent to respond to when you get back to this.
Estimating values then optimizing those seems (much) worse than optimizing “what the user wants.” One natural strategy for getting what the user wants can be something like “get into a good position to influence the world and then ask the user later.”
I don’t have a very strong view about the distinction between corrigibility to the user and corrigibility to some other definition of value (e.g. a hypothetical version of the user who is more secure).
By “trying to find a strategy that’s good for the user” I mean: trying to pursue the kind of resources that the user thinks are valuable, without costs that the user would consider serious, etc.
I don’t understand this statement, in part because I have little idea what “corrigibility to some other definition of value” means, and in part because I don’t know why you bring up this distinction at all, or what a “strong view” here might be about.
What if the user fails to realize that a certain kind of resource is valuable? (By “resources” we’re talking about things that include more than just physical resources, like control of strategic locations, useful technologies that might require long lead times to develop, reputations, etc., right?)
I don’t understand why, if the aligned AI is depending on the user to do long-term planning (i.e., figure out what resources are valuable to pursue today for reaching future goals), that will be competitive with unaligned AIs doing superhuman long-term planning. Is this just a (seemingly very obvious) failure mode for “strategy-stealing” that you forgot to list, or am I still misunderstanding something?
ETA: See also this earlier comment where I asked this question in a slightly different way.
As long as the user and AI appreciate the arguments we are making right now, then we shouldn’t expect it to do worse than stealing the unaligned AI’s strategy. There is all the usual ambiguity about “what the user wants,” but if the user expects that the resources other agents are gathering will be more useful than the resources its AI is gathering, then its AI would clearly do better (in the user’s view) by doing what others are doing.
(I think I won’t have time to engage much on this in the near future, it seems plausible that I am skipping enough steps or using language in an unfamiliar enough way that this won’t make sense to readers in which case so it goes; it’s also possible that I’m missing something.)
There could easily be an abstract argument that other agents are gathering more useful resources, but still no way (or no corrigible way) to “do better by doing what others are doing”. For example suppose I’m playing chess with a superhuman AI. I know the other agent is gathering more useful resources (e.g., taking up better board positions) but there’s nothing I can do about it except to turn over all of my decisions to my own AI that optimizes directly for winning the game (rather than for any instrumental or short-term preferences I might have for how to win the game).
Ok, I tried to summarize my current thoughts on this topic as clearly as I can here, so you’ll have something concise and coherent to respond to when you get back to this.