Value learning. Building an AI that learns all of human value has historically been thought to be very hard, because it requires you to decompose human behavior into the “beliefs and planning” part and the “values” part, and there’s no clear way to do this.
My understanding is that IRL requires this, but it’s not obvious to me that supervised learning does? (It’s surprising to me how little attention supervised learning has received in AI alignment circles, given that it’s by far the most common way for us to teach current ML systems about our values.)
Anyway, regarding IRL: I can see how it would be harmful to make the mistake of attributing stuff to the planner which actually belongs in the values part.
For example, perhaps our AI observes a mother caring for her disabled child, and believes that the mother’s goal is to increase her inclusive fitness in an evolutionary sense, but that the mother is irrational and is following a suboptimal strategy for doing this. So the AI executes a “better” strategy for increasing inclusive fitness which allocates resources away from the child.
However, I haven’t seen a clear story for why the opposite mistake, of attributing stuff to the values part which actually belongs to the planner, would cause a catastrophe. It seems to me that in the limit, attributing all human behavior as arising from human values could end up looking something like an upload—that is, it still makes the stupid mistakes that humans make, and it might not be competitive with other approaches, but it doesn’t seem to be unaligned in the sense that we normally use the term. You could make a speed superintelligence which basically values behaving as much like the humans it has observed as possible. But if this scenario is multipolar, each actor could be incentivized to spin the values/planner dial of its AI towards attributing more of human behavior to the human planner, in order to get an agent which behaves a little more rationally in exchange for a possibly lower fidelity replication of human values.
The long version has a clearer articulation of this point:
For an agent to outperform the process generating its data, it must understand the ways in which that process makes mistakes. So, to outperform humans at a task given only human demonstrations of that task, you need to detect human mistakes in the demonstrations.
So yes, you can achieve performance comparable to that of a human (with both IRL and supervised learning); the hard part is in outperforming the human.
Supervised learning could be used to learn a reward function that evaluates states as well as a human would evaluate states; it is possible that an agent trained on such a reward function could outperform humans at actually creating good states (this would happen if humans were better at evaluating states than at creating good states, which seems plausible).
However, I haven’t seen a clear story for why the opposite mistake, of attributing stuff to the values part which actually belongs to the planner, would cause a catastrophe.
This is the default outcome of IRL; here IRL reduces to imitating a human. If you look at the posts that argue that value learning is hard, they all implicitly or explicitly agree with this point; they’re more concerned with how you get to superhuman performance (presumably because there will be competitive pressure to build superhuman AI systems). It is controversial whether imitating humans is safe (see the Human Models section).
You could make a speed superintelligence which basically values behaving as much like the humans it has observed as possible.
Yeah, the iterated amplification agenda depends on (among other things) a similar hope that it is sufficient to train an AI system that quickly approximates the result of a human thinking for a long time.
it’s not obvious to me that supervised learning does
What type of scheme do you have in mind that would allow an AI to learn our values through supervised learning?
Typically, the problem with supervised learning is that it’s too expensive to label everything we care about. In this case, are you imagining that we label some types of behaviors as good and some as bad, perhaps like what we would do with an approval directed agent? Or are you thinking of something more general or exotic?
Typically, the problem with supervised learning is that it’s too expensive to label everything we care about.
I don’t think we’ll create AGI without first acquiring capabilities that make supervised learning much more sample-efficient (e.g. better unsupervised methods let us better use unlabeled data, so humans no longer need to label everything they care about, and instead can just label enough data to pinpoint “human values” as something that’s observable in the world—or characterize it as a cousin of some things that are observable in the world).
But if you think there are paths to AGI which don’t go through more sample-efficient supervised learning, one course of action would be to promote differential technological development towards more sample-efficient supervised learning and away from deep reinforcement learning. For example, we could try & convince DeepMind and OpenAI to reallocate resources away from deep RL and towards sample efficiency. (Note: I just stumbled on this recent paper which is probably worth a careful read before considering advocacy of this type.)
In this case, are you imagining that we label some types of behaviors as good and some as bad, perhaps like what we would do with an approval directed agent?
My understanding is that IRL requires this, but it’s not obvious to me that supervised learning does? (It’s surprising to me how little attention supervised learning has received in AI alignment circles, given that it’s by far the most common way for us to teach current ML systems about our values.)
Anyway, regarding IRL: I can see how it would be harmful to make the mistake of attributing stuff to the planner which actually belongs in the values part.
For example, perhaps our AI observes a mother caring for her disabled child, and believes that the mother’s goal is to increase her inclusive fitness in an evolutionary sense, but that the mother is irrational and is following a suboptimal strategy for doing this. So the AI executes a “better” strategy for increasing inclusive fitness which allocates resources away from the child.
However, I haven’t seen a clear story for why the opposite mistake, of attributing stuff to the values part which actually belongs to the planner, would cause a catastrophe. It seems to me that in the limit, attributing all human behavior as arising from human values could end up looking something like an upload—that is, it still makes the stupid mistakes that humans make, and it might not be competitive with other approaches, but it doesn’t seem to be unaligned in the sense that we normally use the term. You could make a speed superintelligence which basically values behaving as much like the humans it has observed as possible. But if this scenario is multipolar, each actor could be incentivized to spin the values/planner dial of its AI towards attributing more of human behavior to the human planner, in order to get an agent which behaves a little more rationally in exchange for a possibly lower fidelity replication of human values.
The long version has a clearer articulation of this point:
So yes, you can achieve performance comparable to that of a human (with both IRL and supervised learning); the hard part is in outperforming the human.
Supervised learning could be used to learn a reward function that evaluates states as well as a human would evaluate states; it is possible that an agent trained on such a reward function could outperform humans at actually creating good states (this would happen if humans were better at evaluating states than at creating good states, which seems plausible).
This is the default outcome of IRL; here IRL reduces to imitating a human. If you look at the posts that argue that value learning is hard, they all implicitly or explicitly agree with this point; they’re more concerned with how you get to superhuman performance (presumably because there will be competitive pressure to build superhuman AI systems). It is controversial whether imitating humans is safe (see the Human Models section).
Yeah, the iterated amplification agenda depends on (among other things) a similar hope that it is sufficient to train an AI system that quickly approximates the result of a human thinking for a long time.
What type of scheme do you have in mind that would allow an AI to learn our values through supervised learning?
Typically, the problem with supervised learning is that it’s too expensive to label everything we care about. In this case, are you imagining that we label some types of behaviors as good and some as bad, perhaps like what we would do with an approval directed agent? Or are you thinking of something more general or exotic?
I don’t think we’ll create AGI without first acquiring capabilities that make supervised learning much more sample-efficient (e.g. better unsupervised methods let us better use unlabeled data, so humans no longer need to label everything they care about, and instead can just label enough data to pinpoint “human values” as something that’s observable in the world—or characterize it as a cousin of some things that are observable in the world).
But if you think there are paths to AGI which don’t go through more sample-efficient supervised learning, one course of action would be to promote differential technological development towards more sample-efficient supervised learning and away from deep reinforcement learning. For example, we could try & convince DeepMind and OpenAI to reallocate resources away from deep RL and towards sample efficiency. (Note: I just stumbled on this recent paper which is probably worth a careful read before considering advocacy of this type.)
This seems like a promising option.