<@Embedded agency@>(@Embedded Agents@) is not just a problem for AI systems: humans are embedded agents too; many problems in understanding human values stem from this fact. For example, humans don’t have a well-defined output channel: we can’t say “anything that comes from this keyboard is direct output from the human”, because the AI could seize control of the keyboard and wirehead, or a cat could walk over the keyboard, etc. Similarly, humans can “self-modify”, e.g. by drinking, which often modifies their “values”: what does that imply for value learning? Based on these and other examples, the post concludes that “a better understanding of embedded agents in general will lead to substantial insights about the nature of human values”.
Planned opinion:
I certainly agree that many problems with value learning stem from embedded agency issues with humans, and any <@formal account@>(@Why we need a *theory* of human values@) of this will benefit from general progress in understanding embeddedness. Unlike many others, I do not think we need a formal account of human values, and that a “common-sense” understanding will suffice, including for the embeddedness problems detailed in this post.
One (possibly minor?) point: this isn’t just about value learning; it’s the more general problem of pointing to values. For instance, a system with a human in the loop may not need to learn values; it could rely on the human to provide value judgements. On the other hand, the human still needs to point to their own values in manner usable/interpretable by the rest of the system (possibly with the human doing the “interpretation”, as in e.g. tool AI). Also, the system still needs to point to the human somehow—cats walking on keyboards are still a problem.
Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I’d be interested to read that. (Or if someone else has written up views similar to your own, that works too.)
One (possibly minor?) point: this isn’t just about value learning; it’s the more general problem of pointingto values.
Makes sense, I changed “value learning” to “figuring out what to optimize”.
Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I’d be interested to read that.
Hmm, I was going to say Chapter 3 of the Value Learning sequence, but looking at it again it doesn’t really talk about this. Maybe the post on Following human norms gives some idea of the flavor of what I mean, but it doesn’t explicitly talk about it. Perhaps I should write about this in the future.
Here’s a brief version:
We’ll build ML systems with common sense, because common sense is necessary for tasks of interest; common sense already deals with most (all?) of the human embeddedness problems. There are still two remaining problems:
Ensuring the AI uses its common sense when interpreting our goals / instructions. We’ll probably figure this out in the future; it seems likely that “give instructions in natural language” automatically works (this is the case with human assistants for example).
Ensuring the AI is not trying to deceive us. This seems mostly-independent of human embeddedness. You can certainly construct examples where human embeddedness makes it hard to tell whether something is deceptive or not, but I think in practice “is this deceptive” is a common sense natural category that we can try to detect. (You may not be able to prove theorems, since it relies on common sense understanding; but you could be able to detect deception in any case that actually arises.)
FWIW, my response would be something like: assuming that common-sense reasoning is sufficient, we’ll probably still need a better understanding of embeddedness in order to actually build common-sense reasoning into an AI. When we say “common sense can solve these problems”, it means humans know how to solve the problems, but that doesn’t mean we know how to translate the human understanding into something an AI can use. I do agree that humans already have a good intuition for these problems, but we still don’t know how to automate that intuition.
I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not “common sense” is a natural category that ML-style methods could figure out. I do think it’s a natural category in some sense, but I think we still need a theoretical breakthrough before we’ll be able to point a system at it—and I don’t think systems will acquire human-compatible common sense by default as an instrumentally convergent tool.
I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not “common sense” is a natural category that ML-style methods could figure out.
To give some flavor of why I think ML could figure it out:
I don’t think “common sense” itself is a natural category, but is instead more like a bundle of other things that are natural, e.g. pragmatics. It doesn’t seem like “common sense” is innate to humans; we seem to learn “common sense” somehow (toddlers are often too literal). I don’t see an obvious reason why an ML algorithm shouldn’t be able to do the same thing.
In addition, “common sense” type rules are often very useful for prediction, e.g. if you hear “they gave me a million packets of hot sauce”, and then you want to predict how many packets of hot sauce there are in the bad, you’re going to do better if you understand common sense. So common sense is instrumentally useful for prediction (and probably any other objective you care to name that we might use to train an AI system).
That said, I don’t think it’s a crux for me—even if I believed that current ML systems wouldn’t be able to figure “common sense” out, my main update would be that current ML systems wouldn’t lead to AGI / transformative AI, since I expect most tasks require common sense. Perhaps the crux is “transformative AI will necessarily have figured out most aspects of ‘common sense’”.
Ah, ok, I may have been imagining something different by “common sense” than you are—something more focused on the human-specific parts.
Maybe this claim gets more at the crux: the parts of “common sense” which are sufficient for handling embeddedness issues with human values are not instrumentally convergent; the parts of “common sense” which are instrumentally convergent are not sufficient for human values.
The cat on the keyboard seems like a decent example here (though somewhat oversimplified). If the keyboard suddenly starts emitting random symbols, then it seems like common sense to ignore it—after all, those symbols obviously aren’t coming from a human. On the other hand, if the AI’s objective is explicitly pointing to the keyboard, then that common sense won’t do any good—it doesn’t have any reason to care about the human’s input more than random input a priori, common sense or not. Obviously there are simple ways of handling this particular problem, but it’s not something the AI would learn unless it was pointing to the human to begin with.
Hmm, this seems to be less about whether or not you have common sense, and more about whether the AI system is motivated to use its common sense in interpreting instructions / goals.
I think if you have an AI system that is maximizing an explicit objective, e.g. maximize the numbers input from this keyboard; then the AI will have common sense, but (almost tautologically) won’t use it to interpret the input correctly. (See also Failed Utopia.)
The hope is to train an AI system that doesn’t work like that, in the same way that humans don’t work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
The hope is to train an AI system that doesn’t work like that, in the same way that humans don’t work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
Let me make sure I understand what you’re picturing as an example. Rather than giving an AI an explicit objective, we train it to follow instructions from a human (presumably using something RL-ish?), and the idea is that it will learn something like human common sense in order to better follow instructions. Is that a prototypical case of what you’re imagining? If so, what criteria do you imagine using for training? Maximizing a human approval score? Mimicking a human/predicting what a human would do and then doing that? Some kind of training procedure which somehow avoids optimizing anything at all?
Is that a prototypical case of what you’re imagining?
Yes.
Maximizing a human approval score?
Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints “when asked to follow <instruction i>, I should choose action <most approved action i>”, for instructions and actions it is trained on. It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.
(In practice you’d want to add other things like e.g. interpretability and adversarial training.)
It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action...
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. The final policy won’t point to human values any more robustly than the data collection process did—if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”. Garbage in, garbage out, etc.
More pithily: if a problem can’t be solved by a human typing something into a keyboard, then it also won’t be solved by simulating/predicting what the human would type into the keyboard.
It could be that there’s some viable criterion of “natural” other than just maximizing predictive power, but predictive power alone won’t circumvent the embeddedness problems.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. [...] the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”.
Agreed. I don’t think we will get that policy, because it’s very complex. (It’s much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
I’m making an empirical prediction; so I’m not quantifying “most natural”, reality is.
Tbc, I’m not saying that this is a good on-paper solution to AI safety; it doesn’t seem like we could know in advance that this would work. I’m saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.
I’m also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn’t work.
Planned summary for the Alignment Newsletter:
Planned opinion:
One (possibly minor?) point: this isn’t just about value learning; it’s the more general problem of pointing to values. For instance, a system with a human in the loop may not need to learn values; it could rely on the human to provide value judgements. On the other hand, the human still needs to point to their own values in manner usable/interpretable by the rest of the system (possibly with the human doing the “interpretation”, as in e.g. tool AI). Also, the system still needs to point to the human somehow—cats walking on keyboards are still a problem.
Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I’d be interested to read that. (Or if someone else has written up views similar to your own, that works too.)
Makes sense, I changed “value learning” to “figuring out what to optimize”.
Hmm, I was going to say Chapter 3 of the Value Learning sequence, but looking at it again it doesn’t really talk about this. Maybe the post on Following human norms gives some idea of the flavor of what I mean, but it doesn’t explicitly talk about it. Perhaps I should write about this in the future.
Here’s a brief version:
We’ll build ML systems with common sense, because common sense is necessary for tasks of interest; common sense already deals with most (all?) of the human embeddedness problems. There are still two remaining problems:
Ensuring the AI uses its common sense when interpreting our goals / instructions. We’ll probably figure this out in the future; it seems likely that “give instructions in natural language” automatically works (this is the case with human assistants for example).
Ensuring the AI is not trying to deceive us. This seems mostly-independent of human embeddedness. You can certainly construct examples where human embeddedness makes it hard to tell whether something is deceptive or not, but I think in practice “is this deceptive” is a common sense natural category that we can try to detect. (You may not be able to prove theorems, since it relies on common sense understanding; but you could be able to detect deception in any case that actually arises.)
Thanks, that makes sense.
FWIW, my response would be something like: assuming that common-sense reasoning is sufficient, we’ll probably still need a better understanding of embeddedness in order to actually build common-sense reasoning into an AI. When we say “common sense can solve these problems”, it means humans know how to solve the problems, but that doesn’t mean we know how to translate the human understanding into something an AI can use. I do agree that humans already have a good intuition for these problems, but we still don’t know how to automate that intuition.
I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not “common sense” is a natural category that ML-style methods could figure out. I do think it’s a natural category in some sense, but I think we still need a theoretical breakthrough before we’ll be able to point a system at it—and I don’t think systems will acquire human-compatible common sense by default as an instrumentally convergent tool.
To give some flavor of why I think ML could figure it out:
I don’t think “common sense” itself is a natural category, but is instead more like a bundle of other things that are natural, e.g. pragmatics. It doesn’t seem like “common sense” is innate to humans; we seem to learn “common sense” somehow (toddlers are often too literal). I don’t see an obvious reason why an ML algorithm shouldn’t be able to do the same thing.
In addition, “common sense” type rules are often very useful for prediction, e.g. if you hear “they gave me a million packets of hot sauce”, and then you want to predict how many packets of hot sauce there are in the bad, you’re going to do better if you understand common sense. So common sense is instrumentally useful for prediction (and probably any other objective you care to name that we might use to train an AI system).
That said, I don’t think it’s a crux for me—even if I believed that current ML systems wouldn’t be able to figure “common sense” out, my main update would be that current ML systems wouldn’t lead to AGI / transformative AI, since I expect most tasks require common sense. Perhaps the crux is “transformative AI will necessarily have figured out most aspects of ‘common sense’”.
Ah, ok, I may have been imagining something different by “common sense” than you are—something more focused on the human-specific parts.
Maybe this claim gets more at the crux: the parts of “common sense” which are sufficient for handling embeddedness issues with human values are not instrumentally convergent; the parts of “common sense” which are instrumentally convergent are not sufficient for human values.
The cat on the keyboard seems like a decent example here (though somewhat oversimplified). If the keyboard suddenly starts emitting random symbols, then it seems like common sense to ignore it—after all, those symbols obviously aren’t coming from a human. On the other hand, if the AI’s objective is explicitly pointing to the keyboard, then that common sense won’t do any good—it doesn’t have any reason to care about the human’s input more than random input a priori, common sense or not. Obviously there are simple ways of handling this particular problem, but it’s not something the AI would learn unless it was pointing to the human to begin with.
Hmm, this seems to be less about whether or not you have common sense, and more about whether the AI system is motivated to use its common sense in interpreting instructions / goals.
I think if you have an AI system that is maximizing an explicit objective, e.g. maximize the numbers input from this keyboard; then the AI will have common sense, but (almost tautologically) won’t use it to interpret the input correctly. (See also Failed Utopia.)
The hope is to train an AI system that doesn’t work like that, in the same way that humans don’t work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
Let me make sure I understand what you’re picturing as an example. Rather than giving an AI an explicit objective, we train it to follow instructions from a human (presumably using something RL-ish?), and the idea is that it will learn something like human common sense in order to better follow instructions. Is that a prototypical case of what you’re imagining? If so, what criteria do you imagine using for training? Maximizing a human approval score? Mimicking a human/predicting what a human would do and then doing that? Some kind of training procedure which somehow avoids optimizing anything at all?
Yes.
Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints “when asked to follow <instruction i>, I should choose action <most approved action i>”, for instructions and actions it is trained on. It’s plausible to me that the most “natural” policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.
(In practice you’d want to add other things like e.g. interpretability and adversarial training.)
I’d expect that’s going to depend pretty heavily on how we’re quantifying “most natural”, which brings us right back to the central issue.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected—and that will produce basically the same problems as an actual human at a keyboard. The final policy won’t point to human values any more robustly than the data collection process did—if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human “actually wants”. Garbage in, garbage out, etc.
More pithily: if a problem can’t be solved by a human typing something into a keyboard, then it also won’t be solved by simulating/predicting what the human would type into the keyboard.
It could be that there’s some viable criterion of “natural” other than just maximizing predictive power, but predictive power alone won’t circumvent the embeddedness problems.
Agreed. I don’t think we will get that policy, because it’s very complex. (It’s much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)
I’m making an empirical prediction; so I’m not quantifying “most natural”, reality is.
Tbc, I’m not saying that this is a good on-paper solution to AI safety; it doesn’t seem like we could know in advance that this would work. I’m saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.
I’m also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn’t work.
Cool, I agree with all of that. Thanks for taking the time to talk through this.