Ah, ok, I may have been imagining something different by “common sense” than you are—something more focused on the human-specific parts.
Maybe this claim gets more at the crux: the parts of “common sense” which are sufficient for handling embeddedness issues with human values are not instrumentally convergent; the parts of “common sense” which are instrumentally convergent are not sufficient for human values.
The cat on the keyboard seems like a decent example here (though somewhat oversimplified). If the keyboard suddenly starts emitting random symbols, then it seems like common sense to ignore it—after all, those symbols obviously aren’t coming from a human. On the other hand, if the AI’s objective is explicitly pointing to the keyboard, then that common sense won’t do any good—it doesn’t have any reason to care about the human’s input more than random input a priori, common sense or not. Obviously there are simple ways of handling this particular problem, but it’s not something the AI would learn unless it was pointing to the human to begin with.
Hmm, this seems to be less about whether or not you have common sense, and more about whether the AI system is motivated to use its common sense in interpreting instructions / goals.
I think if you have an AI system that is maximizing an explicit objective, e.g. maximize the numbers input from this keyboard; then the AI will have common sense, but (almost tautologically) won’t use it to interpret the input correctly. (See also Failed Utopia.)
The hope is to train an AI system that doesn’t work like that, in the same way that humans don’t work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
The hope is to train an AI system that doesn’t work like that, in the same way that humans don’t work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
Let me make sure I understand what you’re picturing as an example. Rather than giving an AI an explicit objective, we train it to follow instructions from a human (presumably using something RL-ish?), and the idea is that it will learn something like human common sense in order to better follow instructions. Is that a prototypical case of what you’re imagining? If so, what criteria do you imagine using for training? Maximizing a human approval score? Mimicking a human/predicting what a human would do and then doing that? Some kind of training procedure which somehow avoids optimizing anything at all?
Is that a prototypical case of what you’re imagining?
Yes.
Maximizing a human approval score?
Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints “when asked to follow <instruction i>, I should choose action <most approved action i>”, for instructions and actions it is trained on. It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.
(In practice you’d want to add other things like e.g. interpretability and adversarial training.)
It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action...
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. The final policy won’t point to human values any more robustly than the data collection process did—if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”. Garbage in, garbage out, etc.
More pithily: if a problem can’t be solved by a human typing something into a keyboard, then it also won’t be solved by simulating/predicting what the human would type into the keyboard.
It could be that there’s some viable criterion of “natural” other than just maximizing predictive power, but predictive power alone won’t circumvent the embeddedness problems.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. [...] the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”.
Agreed. I don’t think we will get that policy, because it’s very complex. (It’s much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
I’m making an empirical prediction; so I’m not quantifying “most natural”, reality is.
Tbc, I’m not saying that this is a good on-paper solution to AI safety; it doesn’t seem like we could know in advance that this would work. I’m saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.
I’m also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn’t work.
Ah, ok, I may have been imagining something different by “common sense” than you are—something more focused on the human-specific parts.
Maybe this claim gets more at the crux: the parts of “common sense” which are sufficient for handling embeddedness issues with human values are not instrumentally convergent; the parts of “common sense” which are instrumentally convergent are not sufficient for human values.
The cat on the keyboard seems like a decent example here (though somewhat oversimplified). If the keyboard suddenly starts emitting random symbols, then it seems like common sense to ignore it—after all, those symbols obviously aren’t coming from a human. On the other hand, if the AI’s objective is explicitly pointing to the keyboard, then that common sense won’t do any good—it doesn’t have any reason to care about the human’s input more than random input a priori, common sense or not. Obviously there are simple ways of handling this particular problem, but it’s not something the AI would learn unless it was pointing to the human to begin with.
Hmm, this seems to be less about whether or not you have common sense, and more about whether the AI system is motivated to use its common sense in interpreting instructions / goals.
I think if you have an AI system that is maximizing an explicit objective, e.g. maximize the numbers input from this keyboard; then the AI will have common sense, but (almost tautologically) won’t use it to interpret the input correctly. (See also Failed Utopia.)
The hope is to train an AI system that doesn’t work like that, in the same way that humans don’t work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
Let me make sure I understand what you’re picturing as an example. Rather than giving an AI an explicit objective, we train it to follow instructions from a human (presumably using something RL-ish?), and the idea is that it will learn something like human common sense in order to better follow instructions. Is that a prototypical case of what you’re imagining? If so, what criteria do you imagine using for training? Maximizing a human approval score? Mimicking a human/predicting what a human would do and then doing that? Some kind of training procedure which somehow avoids optimizing anything at all?
Yes.
Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints “when asked to follow <instruction i>, I should choose action <most approved action i>”, for instructions and actions it is trained on. It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.
(In practice you’d want to add other things like e.g. interpretability and adversarial training.)
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. The final policy won’t point to human values any more robustly than the data collection process did—if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”. Garbage in, garbage out, etc.
More pithily: if a problem can’t be solved by a human typing something into a keyboard, then it also won’t be solved by simulating/predicting what the human would type into the keyboard.
It could be that there’s some viable criterion of “natural” other than just maximizing predictive power, but predictive power alone won’t circumvent the embeddedness problems.
Agreed. I don’t think we will get that policy, because it’s very complex. (It’s much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)
I’m making an empirical prediction; so I’m not quantifying “most natural”, reality is.
Tbc, I’m not saying that this is a good on-paper solution to AI safety; it doesn’t seem like we could know in advance that this would work. I’m saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.
I’m also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn’t work.
Cool, I agree with all of that. Thanks for taking the time to talk through this.