I’m much more optimistic about extracting human values from human behavior, for reasons that I’ve only really developed over the last year, and I think it might be worthwhile addressing what they say.
Rohin and Buck mention the model of humans as utility function + decision-making process. This model is underdetermined by the real-world data—you could fit my behavior to a really weird utility function over world-histories plus a flawless planner, or to an even weirder utility function plus an exactly anti-rational planner that always makes the wrong choice. This is all right. But I think they’re mistaken when they associate doing better than this with deriving ought from is.
The reason we can do better is because we’re not trying to find “the true utility function that humans actually have.” (There is no such object.) Instead, the entire notion that humans have something like a utility function can come from how useful it is to think of humans as having goals. It lets us make good coarse-grained predictions of human behavior, given a parsimonious model and little extra information.
I’m very optimistic about being able to use this sort of constraint to do as well as humans at inferring what humans want. But this isn’t all that impressive, because when we think about what other people want, we think about it in a very local way—we don’t try to construct some giant utility function in our heads.
I’m slightly pessimistic about taking this human-scale inference and inferring the entries in a giant utility function over universe-histories.
I’m slightly optimistic about being able to do moral reasoning and planning in a way that doesn’t involve looking up the utility function over universe-histories, but that we could still recognize and acknowledge as being skillfully moral, and probably not going to screw things up.
In this sense, there’s actually a convergence between my view of value learning and the iterated amplification / human mimicry school, because both are trying trying to do superhuman moral reasoning using parts that can’t use vanilla inference about the environment to scale much past human level. But I think the big distinction is that when I think about value learning I’m always thinking about how to directly query the thing playing the role of values, wheras if you are doing human mimicry research, your human-scale parts are communicating with each other via natural language and other normal human interfaces.
I’m slightly optimistic about being able to do moral reasoning and planning in a way that doesn’t involve looking up the utility function over universe-histories, but that we could still recognize and acknowledge as being skillfully moral, and probably not going to screw things up.
I think this describes my position as well (noting especially that it is only “slightly optimistic”), but...
In this sense, there’s actually a convergence between my view of value learning and the iterated amplification / human mimicry school, because both are trying trying to do superhuman moral reasoning using parts that can’t use vanilla inference about the environment to scale much past human level.
This is the part where I’m confused. Where does this extra information come from? How do you get superhuman moral reasoning? Some possible answers:
Humans having particular simple goals is a simple explanation of the world and so will be learned by the AI system, these goals will correctly generalize to new situations better than humans would generalize to new situations because *handwave* the true goals are simple and the false goals that wouldn’t generalize aren’t simple so they won’t be learned; and the AI is better at optimizing for those goals than humans are.
Same thing, except instead of goals being simple, it is philosophical reasoning that is simple.
But I don’t think your comments generally indicate these as the reasons you’re talking about, and I don’t really understand what your proposed reason is.
Confidence level of the following: Lower than what’s represented, but still one of the more profitable ways I’m trying to think about the problem. Not trolling.
Right. So, my definition of superhuman moral reasoning is maybe a bit unusual. But I think it’s because common sense is wrong here.
The common sense view is that humans have some set of true values hidden away somewhere, and that superhuman moral reasoning means doing a better job than humans at adhering to the true values. It just seems super obvious to our common sense that there is some fact of the matter about which patterns in human behavior are the true human values, and which are quirks of the decision-making process.
In short, we intuitively use the agent model of humans in our model of the world, so it feels like ground truth. But we’re trying to solve a problem here where that approximation isn’t very useful.
Instead of looking for “the true values,” I think we should judge superhuman moral reasoning by the standards that humans would use to try to detect superhuman moral reasoning. Humans don’t have a magical true-values detector, but we still have this notion of superhuman reasoning, and we have ideas about what it might look like—carefully deliberative, cosmopolitanism, pro-social rather than anti-social, reflecting experience with the world, leading to pretty good outcomes even by the standards of ordinary humans, unaffected by certain bits of human decision-making we think of as bias rather than preference, that sort of thing.
As Wei Dai would put it, I don’t think that there’s such a thing as superhuman moral reasoning because humans have utility functions, I think there’s such a thing as superhuman moral reasoning because humans are philosophers.
If you have a bunch of efficient human imitations in a factored question answering setup, you get superhuman reasoning by having them try really hard to figure out what the superhumanly-moral thing to do is. And this is real, bona-fide superhuman moral reasoning! It doesn’t matter that it doesn’t have planning-level access to the (nonexistent) true human values, it just matters that it has the sort of properties we gesture at with labels like “superhuman moral reasoning.”
In the value learning worldview, this means that improving moral reasoning and doing meta-preference reasoning1 are inextricably linked. Going from human-level moral reasoning to superhuman moral reasoning (if it’s useful to imagine going from one to the other) looks like a process of taking meta-preferential steps, not like hooking up a better planner to a fixed utility function.
How to do this is, of course, a completely unsolved problem.
[1] By meta-preferences, I mean our opinions about our own decision-making process. What are important preferences and what’s bias, what are better reasons and what are worse ones, when does it feel like being smarter would improve your decisions, etc. Not sure if there’s better terminology for this already.
I think I just completely agree with this and am confused what either Buck or I said that was in opposition to this view.
As a particular example, I don’t think what you said is in opposition to “we must make assumptions to bridge the is-ought gap”.
Like, you also can’t deduce that the sun is going to rise tomorrow—for that, you need to assume that the future will be like the past, or that the world is “simple”, or something like that (see the Problem of Induction). Nonetheless, I feel fine with having some assumption of that form, and baking such an assumption into an AI system. Similarly, just because I say “there needs to be some assumption that bridges the is-ought gap” doesn’t mean that everything is hopeless—there can be an assumption that we find very reasonable that we’re happy to bake into the AI system.
Here’s a stab at the assumption you’re making:
Suppose that “human standards” are a function f that takes in a reasoner r and quantifies how good r is at moral reasoning. (This is an “is” fact about the world.) Then, if we select r∗=argmaxrf(r), then the things that r∗ claims are “ought” facts actually are “ought” facts.
This is similar to, though not the same as, what Buck said:
your system will do an amazing job of answering “is” questions about what humans would say about “ought” questions. And so I guess maybe you could phrase the second part as: to get your system to do things that match human preferences, use the fact that it knows how to make accurate “is” statements about human’s ought statements?
And my followup a bit later:
As Buck said, that lets you predict what humans would say about ought statements, which your assumption could then be, whatever humans say about ought statements, that’s what you ought to do. And that’s still an assumption. Maybe it’s a very reasonable assumption that we’re happy to put it into our AI system.
Also, re
But I think it’s because common sense is wrong here.
The common sense view is that humans have some set of true values hidden away somewhere, and that superhuman moral reasoning means doing a better job than humans at adhering to the true values. It just seems super obvious to our common sense that there is some fact of the matter about which patterns in human behavior are the true human values, and which are quirks of the decision-making process.
This seems like the same thing Buck is saying here:
Buck Shlegeris: I think I want to object to a little bit of your framing there. My stance on utility functions of humans isn’t that there are a bunch of complicated subtleties on top, it’s that modeling humans with utility functions is just a really sad state to be in. If your alignment strategy involves positing that humans behave as expected utility maximizers, I am very pessimistic about it working in the short term, and I just think that we should be trying to completely avoid anything which does that. It’s not like there’s a bunch of complicated sub-problems that we need to work out about how to describe us as expected utility maximizers, my best guess is that we would just not end up doing that because it’s not a good idea.
Tbc, I agree with Buck’s framing here, though maybe not as confidently—it seems plausible though unlikely (~10%?) to me that approximating humans as EU maximizers would turn out okay, even though it isn’t literally true.
Yeah, on going back and reading the transcript, I think I was just misinterpreting what you were talking about. I agree with what you were trying to illustrate to Lucas.
(Though I still don’t think the is-ought distinction is a perfect analogy. This manifests in me also not thinking that the notion of “ought facts” would make it into a later draft of your assumption example—though on the other hand, talking about ought facts makes more sense when you’re trying to find a utility function type object, which [shock] is what you were talking about at that point in the podcast.)
I’m much more optimistic about extracting human values from human behavior, for reasons that I’ve only really developed over the last year, and I think it might be worthwhile addressing what they say.
Rohin and Buck mention the model of humans as utility function + decision-making process. This model is underdetermined by the real-world data—you could fit my behavior to a really weird utility function over world-histories plus a flawless planner, or to an even weirder utility function plus an exactly anti-rational planner that always makes the wrong choice. This is all right. But I think they’re mistaken when they associate doing better than this with deriving ought from is.
The reason we can do better is because we’re not trying to find “the true utility function that humans actually have.” (There is no such object.) Instead, the entire notion that humans have something like a utility function can come from how useful it is to think of humans as having goals. It lets us make good coarse-grained predictions of human behavior, given a parsimonious model and little extra information.
I’m very optimistic about being able to use this sort of constraint to do as well as humans at inferring what humans want. But this isn’t all that impressive, because when we think about what other people want, we think about it in a very local way—we don’t try to construct some giant utility function in our heads.
I’m slightly pessimistic about taking this human-scale inference and inferring the entries in a giant utility function over universe-histories.
I’m slightly optimistic about being able to do moral reasoning and planning in a way that doesn’t involve looking up the utility function over universe-histories, but that we could still recognize and acknowledge as being skillfully moral, and probably not going to screw things up.
In this sense, there’s actually a convergence between my view of value learning and the iterated amplification / human mimicry school, because both are trying trying to do superhuman moral reasoning using parts that can’t use vanilla inference about the environment to scale much past human level. But I think the big distinction is that when I think about value learning I’m always thinking about how to directly query the thing playing the role of values, wheras if you are doing human mimicry research, your human-scale parts are communicating with each other via natural language and other normal human interfaces.
I think this describes my position as well (noting especially that it is only “slightly optimistic”), but...
This is the part where I’m confused. Where does this extra information come from? How do you get superhuman moral reasoning? Some possible answers:
Humans having particular simple goals is a simple explanation of the world and so will be learned by the AI system, these goals will correctly generalize to new situations better than humans would generalize to new situations because *handwave* the true goals are simple and the false goals that wouldn’t generalize aren’t simple so they won’t be learned; and the AI is better at optimizing for those goals than humans are.
Same thing, except instead of goals being simple, it is philosophical reasoning that is simple.
But I don’t think your comments generally indicate these as the reasons you’re talking about, and I don’t really understand what your proposed reason is.
Confidence level of the following: Lower than what’s represented, but still one of the more profitable ways I’m trying to think about the problem. Not trolling.
Right. So, my definition of superhuman moral reasoning is maybe a bit unusual. But I think it’s because common sense is wrong here.
The common sense view is that humans have some set of true values hidden away somewhere, and that superhuman moral reasoning means doing a better job than humans at adhering to the true values. It just seems super obvious to our common sense that there is some fact of the matter about which patterns in human behavior are the true human values, and which are quirks of the decision-making process.
In short, we intuitively use the agent model of humans in our model of the world, so it feels like ground truth. But we’re trying to solve a problem here where that approximation isn’t very useful.
Instead of looking for “the true values,” I think we should judge superhuman moral reasoning by the standards that humans would use to try to detect superhuman moral reasoning. Humans don’t have a magical true-values detector, but we still have this notion of superhuman reasoning, and we have ideas about what it might look like—carefully deliberative, cosmopolitanism, pro-social rather than anti-social, reflecting experience with the world, leading to pretty good outcomes even by the standards of ordinary humans, unaffected by certain bits of human decision-making we think of as bias rather than preference, that sort of thing.
As Wei Dai would put it, I don’t think that there’s such a thing as superhuman moral reasoning because humans have utility functions, I think there’s such a thing as superhuman moral reasoning because humans are philosophers.
If you have a bunch of efficient human imitations in a factored question answering setup, you get superhuman reasoning by having them try really hard to figure out what the superhumanly-moral thing to do is. And this is real, bona-fide superhuman moral reasoning! It doesn’t matter that it doesn’t have planning-level access to the (nonexistent) true human values, it just matters that it has the sort of properties we gesture at with labels like “superhuman moral reasoning.”
In the value learning worldview, this means that improving moral reasoning and doing meta-preference reasoning1 are inextricably linked. Going from human-level moral reasoning to superhuman moral reasoning (if it’s useful to imagine going from one to the other) looks like a process of taking meta-preferential steps, not like hooking up a better planner to a fixed utility function.
How to do this is, of course, a completely unsolved problem.
[1] By meta-preferences, I mean our opinions about our own decision-making process. What are important preferences and what’s bias, what are better reasons and what are worse ones, when does it feel like being smarter would improve your decisions, etc. Not sure if there’s better terminology for this already.
I think I just completely agree with this and am confused what either Buck or I said that was in opposition to this view.
As a particular example, I don’t think what you said is in opposition to “we must make assumptions to bridge the is-ought gap”.
Like, you also can’t deduce that the sun is going to rise tomorrow—for that, you need to assume that the future will be like the past, or that the world is “simple”, or something like that (see the Problem of Induction). Nonetheless, I feel fine with having some assumption of that form, and baking such an assumption into an AI system. Similarly, just because I say “there needs to be some assumption that bridges the is-ought gap” doesn’t mean that everything is hopeless—there can be an assumption that we find very reasonable that we’re happy to bake into the AI system.
Here’s a stab at the assumption you’re making:
Suppose that “human standards” are a function f that takes in a reasoner r and quantifies how good r is at moral reasoning. (This is an “is” fact about the world.) Then, if we select r∗=argmaxrf(r), then the things that r∗ claims are “ought” facts actually are “ought” facts.
This is similar to, though not the same as, what Buck said:
And my followup a bit later:
Also, re
This seems like the same thing Buck is saying here:
Tbc, I agree with Buck’s framing here, though maybe not as confidently—it seems plausible though unlikely (~10%?) to me that approximating humans as EU maximizers would turn out okay, even though it isn’t literally true.
Yeah, on going back and reading the transcript, I think I was just misinterpreting what you were talking about. I agree with what you were trying to illustrate to Lucas.
(Though I still don’t think the is-ought distinction is a perfect analogy. This manifests in me also not thinking that the notion of “ought facts” would make it into a later draft of your assumption example—though on the other hand, talking about ought facts makes more sense when you’re trying to find a utility function type object, which [shock] is what you were talking about at that point in the podcast.)