If you want to locate values within a model of humans, you can’t just train the model for predictive power, because human values only appear in a narrow zone of abstraction, more abstract than biology and less abstract than population statistics, and an AI scored only on prediction will be pressured to go to a lower level of abstraction.
I don’t understand why you’re so confident. It doesn’t seem to me that my values are divorced from biology (I want my body to stay healthy) or population statistics (I want a large population of people living happy lives). And if the AI knows the level of abstraction below my values, that sounds great, if it’s found a good way to express my values in terms of those lower-level abstractions. Indeed, this is exactly what I’d expect a functioning FAI to do!
If you’re worried that the AI will generalize poorly, that’s a reasonable worry! But, everyone in machine learning knows generalization is an extremely important problem, so maybe also say whether you think “better machine learning” will solve the problem, and if not, why not.
Once it starts encoding the world differently than we do, it won’t have the generalization properties we want—we’d be caught cheating, as it were.
Are you sure?
...brains, by contrast to the kinds of program we typically run on our computers, do not use standardized data storage and representation formats. Rather, each brain develops its own idiosyncratic representations of higher-level content. Which particular neuronal assemblies are recruited to represent a particular concept depends on the unique experiences of the brain in question (along with various genetic factors and stochastic physiological processes). Just as in artificial neural nets, meaning in biological neural networks is likely represented holistically in the structure and activity patterns of sizeable overlapping regions, not in discrete memory cells laid out in neat arrays.
From Superintelligence, p. 46.
I think your claim proves too much. Different human brains have different encodings, and yet we are still able to learn the values of other humans (for example, when visiting a foreign country) reasonably well when we make an honest effort.
BTW, I think it’s harmful to confidently dismiss possible approaches to Friendly AI based on shakey reasoning, especially if those approaches are simple. Simple approaches are more likely to be robust, and if the AI safety community has a bias towards assigning credence to pessimistic statements over optimistic statements (even when the pessimistic statements are based on shakey reasoning or reasoning which hasn’t been critically analyzed by others), that may cause us to neglect approaches which could actually be fruitful.
I don’t understand why you’re so confident. It doesn’t seem to me that my values are divorced from biology (I want my body to stay healthy) or population statistics (I want a large population of people living happy lives).
When I say your preference is “more abstract than biology,” I’m not saying you’re not allowed to care about your body, I’m saying something about what kind of language you’re speaking when you talk about the world. When you say you want to stay healthy, you use a fairly high-level abstraction (“healthy”), you don’t specify which cell organelles should be doing what, or even the general state of all your organs.
This choice of level of abstraction matters for generalization. At our current level of technology, an abstract “healthy” and an organ-level description might have the same outcomes, but at higher levels of technology, maybe someone who preferred to be healthy would be fine becoming a cyborg, while someone who wanted to preserve some lower-level description of their body would be against it.
“Once it starts encoding the world differently than we do, it won’t have the generalization properties we want—we’d be caught cheating, as it were.”
Are you sure?
I think the right post to link here is this one by Kaj Sotala. I’m not totally sure—there may be some way to “cheat” in practice—but my default view is definitely that if the AI carves the world up along different boundaries than we do, it won’t generalize in the same way we would, given the same patterns.
Nice find on the Bostrom quote btw.
I think your claim proves too much. Different human brains have different encodings, and yet we are still able to learn the values of other humans (for example, when visiting a foreign country) reasonably well when we make an honest effort.
I would bite this bullet, and say that when humans are doing generalization of values into novel situations (like trolley problems, or utopian visions), they can end up at very different places even if they agree on all of the everyday cases.
If you succeed at learning the values of a foreigner, so well that you can generalize those values to new domains, I’d suspect that the simplest way for you to do it involves learning about what concepts they’re using well enough to do the right steps in reasoning. If you just saw a snippet of their behavior and couldn’t talk to them about their values, you’d probably do a lot worse—and I think that’s the position many current value learning schemes place AI in.
Each of your three responses talks about generalization. As I mentioned, generalization is one of the central problems of machine learning. For example, in his critique of deep learning, Gary Marcus writes:
What we have seen in this paper is that challenges in generalizing beyond a space of training examples persist in current deep learning networks, nearly two decades later. Many of the problems reviewed in this paper — the data hungriness, the vulnerability to fooling, the problems in dealing with open-ended inference and transfer — can be seen as extension of this fundamental problem.
If Gary Marcus is right, then we’ll need algorithms which generalize better if we’re going to get to AGI anyway. So my question is: Let’s suppose better generalization is indeed a prerequisite for AGI, which means we can count on algorithms which generalize well being available at the time we are trying to construct our FAI. What other problems might we encounter? Is FAI mainly a matter of making absolutely sure that the algorithms do in fact generalize, or are there still other worries?
BTW, note that perfect generalization is not actually needed. It’s sufficient for the system to know when its models might not apply in a particular situation (due to “distributional shift”) and ask for clarification at that point in time. See also this previous thread where I claimed that “I haven’t yet seen an FAI problem which seems like it can’t somehow be reduced to calibrated learning.” (Probably hyperbolic.)
I don’t think whether labels are provided by humans fundamentally changes the nature of the generalization problem. Humans providing labels is very typical in mainstream ML work.
I’m not excited about FAI work which creates a lot of additional complexity in order to get around the fact that current algorithms don’t generalize well, if algorithms are going to have to generalize better in order to get us to AGI anyway.
Yes, I agree that generalization is important. But I think it’s a bit too reductive to think of generalization ability as purely a function of the algorithm.
For example, an image-recognition algorithm trained with dropout generalizes better, because dropout acts like an extra goal telling the training process to search for category boundaries that are smooth in a certain sense. And the reason we expect that to work is because we know that the category boundaries we’re looking for are in fact usually smooth in that sense.
So it’s not like dropout is a magic algorithm that violates a no-free-lunch theorem and extracts generalization power from nowhere. The power that it has comes from our knowledge about the world that we have encoded into it.
(And there is a no free lunch theorem here. How to generalize beyond the training data is not uniquely encoded in the training data, every bit of information in the generalization process has to come from your model and training procedure.)
For value learning, we want the AI to have a very specific sort of generalization skill when it comes to humans. It has to not only predict human actions, it has to make a very particular sort of generalization (“human values”), and single out part of that generalization to make plans with. The information to pick out one particular generalization rather than another has to come from humans doing hard, complicated work, even if it gets encoded into the algorithm.
The power that it has comes from our knowledge about the world that we have encoded into it.
This knowledge could also come from other sources, e.g. transfer learning.
We know that human children are capable of generalizing from many fewer examples than ML algorithms. That suggests human brains are fundamentally better at learning in some sense. I think we’ll be able to replicate this capability before we get to AGI.
For value learning, we want the AI to have a very specific sort of generalization skill when it comes to humans. It has to not only predict human actions, it has to make a very particular sort of generalization (“human values”), and single out part of that generalization to make plans with.
As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is “learning and respecting human preferences”, object recognition is “learning human preferences about how to categorize images”, and sentiment analysis is “learning human preferences about how to categorize sentences”.
I’ve never heard anyone in machine learning divide the field into cases where we’re trying to generalize about human values and cases where we aren’t. It seems like the same set of algorithms, tricks, etc. work either way.
Suppose that you are given a problem to solve, I don’t care what kind of a problem — a machine to design, or a physical theory to develop, or a mathematical theorem to prove, or something of that kind — probably a very powerful approach to this is to attempt to eliminate everything from the problem except the essentials; that is, cut it down to size. Almost every problem that you come across is befuddled with all kinds of extraneous data of one sort or another; and if you can bring this problem down into the main issues, you can see more clearly what you’re trying to do and perhaps find a solution. Now, in so doing, you may have stripped away the problem that you’re after. You may have simplified it to a point that it doesn’t even resemble the problem that you started with; but very often if you can solve this simple problem, you can add refinements to the solution of this until you get back to the solution of the one you started with.
In other words, I think trying to find the “essential core” of a problem is a good problem-solving strategy, including for a problem like friendliness. I have yet to see a non-handwavey argument against the idea that generalization is the “essential core” of friendliness.
The information to pick out one particular generalization rather than another has to come from humans doing hard, complicated work, even if it gets encoded into the algorithm.
I actually think the work humans do can be straightforward and easy. Something like: Have the system find every possible generalization which seems reasonable, then synthesize examples those generalizations disagree on. Keep asking about the humans about those synthesized examples until you’ve narrowed down the number of possible generalizations the human plausibly wants to the point where you can be reasonably confident about the human’s desired behavior in a particular circumstance.
I think this sort of approach is typically referred to as “active learning” or “machine teaching” by ML practitioners. But it’s not too different from the procedure that you would use to learn about someone’s values if you were visiting a foreign country.
Ah, but I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone. I expect you’d run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is “learning and respecting human preferences”, object recognition is “human preferences about how to categorize images”, and sentiment analysis is “human preferences about how to categorize sentences”
I somewhat agree, but you could equally well call them “learning human behavior at categorizing images,” “learning human behavior at categorizing sentences,” etc. I don’t think that’s enough. If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.
So this is two separate problems: one, I think humans can’t reliably tell an AI what they value through a text channel, even with prompting, and two, I think that mimicking human behavior, even human behavior on moral questions, is insufficient to deal with the possibilities of the future.
I’ve never heard anyone in machine learning divide the field into cases where we’re trying to generalize about human values and cases where we aren’t. It seems like the same set of algorithms, tricks, etc. work either way.
It also sounds silly to say that one can divide the field into cases where you’re doing model-based reinforcement learning, and cases where you aren’t. The point isn’t the division, it’s that model-based reinforcement learning is solving a specific type of problem.
Let me take another go at the distinction: Suppose you have a big training set of human answers to moral questions. There are several different things you could mean by “generalize well” in this case, which correspond to solving different problems.
The first kind of “generalize well” is where the task is to predict moral answers drawn from the same distribution as the training set. This is what most of the field is doing right now for Ian Goodfellow’s examples of categorizing images or categorizing sentences. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing the test set.
Another sort of “generalize well” might be inferring a larger “real world” distribution even when the training set is limited. For example, if you’re given labeled data for handwritten digits 0-20 into binary outputs, can you give the correct binary output for 21? How about 33? In our moral questions example, this would be like predicting answers to moral questions spawned by novel situations not seen in training. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing examples later drawn from the real world.
Let’s stop here for a moment and point out that if we want generalization in the second sense, algorithmic advances in the first sense might be useful, but they aren’t sufficient. For the classifier to output the binary for 33, it probably has to be deliberately designed to learn flexible representations, and probably get fed some additional information (e.g. by transfer learning). When the training distribution and the “real world” distribution are different, you’re solving a different problem than when they’re the same.
A third sort of “generalize well” is to learn superhumanly skilled answers even if the training data is flawed or limited. Think of an agent that learns to play Atari games at a superhuman level, from human demonstrations. This generalization task often involves filling in a complex model of the human “expert,” along with learning about the environment—for current examples, the model of the human is usually hand-written. The better we get at generalizing in this way, the more the AI’s answers will be like “what we meant” (either by some metric we kept hidden from the AI, or in some vague intuitive sense) even if they diverge from what humans would answer.
(I’m sure there are more tasks that fall under the umbrella of “generalization,” but you’ll have to suggest them yourself :) )
So while I’d say that value learning involves generalization, I think that generalization can mean a lot of different tasks—a rising tide of type 1 generalization (which is the mathematically simple kind) won’t lift all boats.
Ah, but I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
I expect you’d run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
Scott is describing distributional shift in that essay. Here’s a quote:
The further we go toward the tails, the more extreme the divergences become. Utilitarianism agrees that we should give to charity and shouldn’t steal from the poor, because Utility, but take it far enough to the tails and we should tile the universe with rats on heroin. Religious morality agrees that we should give to charity and shouldn’t steal from the poor, because God, but take it far enough to the tails and we should spend all our time in giant cubes made of semiprecious stones singing songs of praise. Deontology agrees that we should give to charity and shouldn’t steal from the poor, because Rules, but take it far enough to the tails and we all have to be libertarians.
The “distribution” is the set of moral questions that we find ourselves pondering in our everyday lives. Each moral theory (Utilitarianism, religious morality, etc.) is an attempt to make sense of our moral intuitions in a variety of different situations and “fit a curve” through them somehow. The trouble comes when we start considering unusual “off-distribution” moral situations and asking what our moral intuitions say in those situations.
So this isn’t actually a different problem. As Shannon said, once you pare away the extraneous data, you get a simplified problem which represents the core of what needs to be accomplished.
humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
Yep. I address this in this comment; search for “The problem is that the overseer has insufficient time to reflect on their true values.”
I somewhat agree, but you could equally well call them “learning human behavior at categorizing images,” “learning human behavior at categorizing sentences,” etc.
Sure, so we just have to learn human behavior at categorizing desired/undesired behavior from our AGI. Approval-direction, essentially.
If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.
If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
It also sounds silly to say that one can divide the field into cases where you’re doing model-based reinforcement learning, and cases where you aren’t. The point isn’t the division, it’s that model-based reinforcement learning is solving a specific type of problem.
Sure. So my point is, so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.) So this all suggests that it isn’t actually a different problem, fundamentally speaking.
By the way, everything I’ve been saying is about supervised learning, not RL.
I agree with the rest of your comment. I’m focused on the second kind of generalization. As you say, work on the first kind may or may not be useful. I think you can get from the second kind (correctly replicating human labels) to the third kind (“superhuman” labels that the overseer wishes they had thought of themselves) based on active learning, as I described earlier.
“I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.”
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
Eliezer Yudkowsky wrote: “If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.”
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.)
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.
I don’t understand why you’re so confident. It doesn’t seem to me that my values are divorced from biology (I want my body to stay healthy) or population statistics (I want a large population of people living happy lives). And if the AI knows the level of abstraction below my values, that sounds great, if it’s found a good way to express my values in terms of those lower-level abstractions. Indeed, this is exactly what I’d expect a functioning FAI to do!
If you’re worried that the AI will generalize poorly, that’s a reasonable worry! But, everyone in machine learning knows generalization is an extremely important problem, so maybe also say whether you think “better machine learning” will solve the problem, and if not, why not.
Are you sure?
From Superintelligence, p. 46.
I think your claim proves too much. Different human brains have different encodings, and yet we are still able to learn the values of other humans (for example, when visiting a foreign country) reasonably well when we make an honest effort.
BTW, I think it’s harmful to confidently dismiss possible approaches to Friendly AI based on shakey reasoning, especially if those approaches are simple. Simple approaches are more likely to be robust, and if the AI safety community has a bias towards assigning credence to pessimistic statements over optimistic statements (even when the pessimistic statements are based on shakey reasoning or reasoning which hasn’t been critically analyzed by others), that may cause us to neglect approaches which could actually be fruitful.
When I say your preference is “more abstract than biology,” I’m not saying you’re not allowed to care about your body, I’m saying something about what kind of language you’re speaking when you talk about the world. When you say you want to stay healthy, you use a fairly high-level abstraction (“healthy”), you don’t specify which cell organelles should be doing what, or even the general state of all your organs.
This choice of level of abstraction matters for generalization. At our current level of technology, an abstract “healthy” and an organ-level description might have the same outcomes, but at higher levels of technology, maybe someone who preferred to be healthy would be fine becoming a cyborg, while someone who wanted to preserve some lower-level description of their body would be against it.
I think the right post to link here is this one by Kaj Sotala. I’m not totally sure—there may be some way to “cheat” in practice—but my default view is definitely that if the AI carves the world up along different boundaries than we do, it won’t generalize in the same way we would, given the same patterns.
Nice find on the Bostrom quote btw.
I would bite this bullet, and say that when humans are doing generalization of values into novel situations (like trolley problems, or utopian visions), they can end up at very different places even if they agree on all of the everyday cases.
If you succeed at learning the values of a foreigner, so well that you can generalize those values to new domains, I’d suspect that the simplest way for you to do it involves learning about what concepts they’re using well enough to do the right steps in reasoning. If you just saw a snippet of their behavior and couldn’t talk to them about their values, you’d probably do a lot worse—and I think that’s the position many current value learning schemes place AI in.
Each of your three responses talks about generalization. As I mentioned, generalization is one of the central problems of machine learning. For example, in his critique of deep learning, Gary Marcus writes:
If Gary Marcus is right, then we’ll need algorithms which generalize better if we’re going to get to AGI anyway. So my question is: Let’s suppose better generalization is indeed a prerequisite for AGI, which means we can count on algorithms which generalize well being available at the time we are trying to construct our FAI. What other problems might we encounter? Is FAI mainly a matter of making absolutely sure that the algorithms do in fact generalize, or are there still other worries?
BTW, note that perfect generalization is not actually needed. It’s sufficient for the system to know when its models might not apply in a particular situation (due to “distributional shift”) and ask for clarification at that point in time. See also this previous thread where I claimed that “I haven’t yet seen an FAI problem which seems like it can’t somehow be reduced to calibrated learning.” (Probably hyperbolic.)
I don’t think whether labels are provided by humans fundamentally changes the nature of the generalization problem. Humans providing labels is very typical in mainstream ML work.
I’m not excited about FAI work which creates a lot of additional complexity in order to get around the fact that current algorithms don’t generalize well, if algorithms are going to have to generalize better in order to get us to AGI anyway.
Yes, I agree that generalization is important. But I think it’s a bit too reductive to think of generalization ability as purely a function of the algorithm.
For example, an image-recognition algorithm trained with dropout generalizes better, because dropout acts like an extra goal telling the training process to search for category boundaries that are smooth in a certain sense. And the reason we expect that to work is because we know that the category boundaries we’re looking for are in fact usually smooth in that sense.
So it’s not like dropout is a magic algorithm that violates a no-free-lunch theorem and extracts generalization power from nowhere. The power that it has comes from our knowledge about the world that we have encoded into it.
(And there is a no free lunch theorem here. How to generalize beyond the training data is not uniquely encoded in the training data, every bit of information in the generalization process has to come from your model and training procedure.)
For value learning, we want the AI to have a very specific sort of generalization skill when it comes to humans. It has to not only predict human actions, it has to make a very particular sort of generalization (“human values”), and single out part of that generalization to make plans with. The information to pick out one particular generalization rather than another has to come from humans doing hard, complicated work, even if it gets encoded into the algorithm.
This knowledge could also come from other sources, e.g. transfer learning.
We know that human children are capable of generalizing from many fewer examples than ML algorithms. That suggests human brains are fundamentally better at learning in some sense. I think we’ll be able to replicate this capability before we get to AGI.
As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is “learning and respecting human preferences”, object recognition is “learning human preferences about how to categorize images”, and sentiment analysis is “learning human preferences about how to categorize sentences”.
I’ve never heard anyone in machine learning divide the field into cases where we’re trying to generalize about human values and cases where we aren’t. It seems like the same set of algorithms, tricks, etc. work either way.
BTW, Claude Shannon once wrote:
In other words, I think trying to find the “essential core” of a problem is a good problem-solving strategy, including for a problem like friendliness. I have yet to see a non-handwavey argument against the idea that generalization is the “essential core” of friendliness.
I actually think the work humans do can be straightforward and easy. Something like: Have the system find every possible generalization which seems reasonable, then synthesize examples those generalizations disagree on. Keep asking about the humans about those synthesized examples until you’ve narrowed down the number of possible generalizations the human plausibly wants to the point where you can be reasonably confident about the human’s desired behavior in a particular circumstance.
I think this sort of approach is typically referred to as “active learning” or “machine teaching” by ML practitioners. But it’s not too different from the procedure that you would use to learn about someone’s values if you were visiting a foreign country.
Ah, but I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone. I expect you’d run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
I somewhat agree, but you could equally well call them “learning human behavior at categorizing images,” “learning human behavior at categorizing sentences,” etc. I don’t think that’s enough. If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.
So this is two separate problems: one, I think humans can’t reliably tell an AI what they value through a text channel, even with prompting, and two, I think that mimicking human behavior, even human behavior on moral questions, is insufficient to deal with the possibilities of the future.
It also sounds silly to say that one can divide the field into cases where you’re doing model-based reinforcement learning, and cases where you aren’t. The point isn’t the division, it’s that model-based reinforcement learning is solving a specific type of problem.
Let me take another go at the distinction: Suppose you have a big training set of human answers to moral questions. There are several different things you could mean by “generalize well” in this case, which correspond to solving different problems.
The first kind of “generalize well” is where the task is to predict moral answers drawn from the same distribution as the training set. This is what most of the field is doing right now for Ian Goodfellow’s examples of categorizing images or categorizing sentences. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing the test set.
Another sort of “generalize well” might be inferring a larger “real world” distribution even when the training set is limited. For example, if you’re given labeled data for handwritten digits 0-20 into binary outputs, can you give the correct binary output for 21? How about 33? In our moral questions example, this would be like predicting answers to moral questions spawned by novel situations not seen in training. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing examples later drawn from the real world.
Let’s stop here for a moment and point out that if we want generalization in the second sense, algorithmic advances in the first sense might be useful, but they aren’t sufficient. For the classifier to output the binary for 33, it probably has to be deliberately designed to learn flexible representations, and probably get fed some additional information (e.g. by transfer learning). When the training distribution and the “real world” distribution are different, you’re solving a different problem than when they’re the same.
A third sort of “generalize well” is to learn superhumanly skilled answers even if the training data is flawed or limited. Think of an agent that learns to play Atari games at a superhuman level, from human demonstrations. This generalization task often involves filling in a complex model of the human “expert,” along with learning about the environment—for current examples, the model of the human is usually hand-written. The better we get at generalizing in this way, the more the AI’s answers will be like “what we meant” (either by some metric we kept hidden from the AI, or in some vague intuitive sense) even if they diverge from what humans would answer.
(I’m sure there are more tasks that fall under the umbrella of “generalization,” but you’ll have to suggest them yourself :) )
So while I’d say that value learning involves generalization, I think that generalization can mean a lot of different tasks—a rising tide of type 1 generalization (which is the mathematically simple kind) won’t lift all boats.
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
Scott is describing distributional shift in that essay. Here’s a quote:
The “distribution” is the set of moral questions that we find ourselves pondering in our everyday lives. Each moral theory (Utilitarianism, religious morality, etc.) is an attempt to make sense of our moral intuitions in a variety of different situations and “fit a curve” through them somehow. The trouble comes when we start considering unusual “off-distribution” moral situations and asking what our moral intuitions say in those situations.
So this isn’t actually a different problem. As Shannon said, once you pare away the extraneous data, you get a simplified problem which represents the core of what needs to be accomplished.
Yep. I address this in this comment; search for “The problem is that the overseer has insufficient time to reflect on their true values.”
Sure, so we just have to learn human behavior at categorizing desired/undesired behavior from our AGI. Approval-direction, essentially.
Eliezer Yudkowsky wrote:
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
Sure. So my point is, so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.) So this all suggests that it isn’t actually a different problem, fundamentally speaking.
By the way, everything I’ve been saying is about supervised learning, not RL.
I agree with the rest of your comment. I’m focused on the second kind of generalization. As you say, work on the first kind may or may not be useful. I think you can get from the second kind (correctly replicating human labels) to the third kind (“superhuman” labels that the overseer wishes they had thought of themselves) based on active learning, as I described earlier.
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.