The point isn’t that there is nothing wrong or dangerous about learning biases and rewards. The point is that the OP is not very relevant to those concerns. The OP says that learning can’t be done without extra assumptions, but we have plenty of natural assumptions to choose from. The fact that assumptions are needed is interesting, but it is by no means a strong argument against IRL.
What if in reality due to effects currently beyond our understanding, our actions are making the future more likely to be dystopian in some way than if we took random actions?
That’s an interesting question, because we obviously are taking actions that make the future more likely to be dystopian—we’re trying to develop AGI, which might turn out unfriendly.
we have plenty of natural assumptions to choose from.
You’d think so, but nobody has defined these assumptions in anything like sufficient detail to make IRL work. My whole research agenda is essentially a way of defining these assumptions, and it seems to be a long and complicated process.
Evaluating R on a single example of human behavior is good enough to reject R(2), R(4) and possibly R(3).
Example: this morning I went to the kitchen and picked up a knife. Among possible further actions, I had A—“make a sandwich” and B—“stab myself in the gut”. I chose A. R(2) and R(4) say I wanted B and R(3) is indifferent. I think that’s enough reason to discard them.
Why not do this? Do you not agree that this test discards dangerous R more often than useful R? My guess is that you’re asking for very strong formal guarantees from the assumptions that you consider and use a narrow interpretation of what it means to “make IRL work”.
Rejecting any specific R is easy—one bit of information (at most) per specific R. So saying “humans have preferences, and they are not always rational or always anti-rational” rules out R(1), R(2), and R(3). Saying “this apparent preference is genuine” rules out R(4).
But it’s not like there are just these five preferences and once we have four of them out of the way, we’re done. There are many, many different preferences in the space of preferences, and many, many of them will be simpler than R(0). So to converge to R(0), we need to add huge amounts of information, ruling out more and more examples.
Basically, we need to include enough information to define R(0) - which is what my research project is trying to do. What you’re seeing as “adding enough clear examples” is actually “hand-crafting R(0) in totality”.
But it’s not like there are just these five preferences and once we have four of them out of the way, we’re done.
My example test is not nearly as specific as you imply. It discards large swaths of harmful and useless reward functions. Additional test cases would restrict the space further. There are still harmful Rs in the remaining space, but their proportion must be much lower than in the beginning. Is that not good enough?
What you’re seeing as “adding enough clear examples” is actually “hand-crafting R(0) in totality”.
Are you saying that R can’t generalize if trained on a reasonably sized data set? This is very significant, if true, but I don’t see it.
This might be a nitpick, but there is no such thing. If the agent was not originally composed from p and R, then none of the decompositions are “true”. There are only “useful” decompositions. But that itself requires many assumptions about how usefulness is measured. I’m confused about how much of a problem this is. But it might be a big part of our philosophical difference—I want to slap together some ad hoc stuff that possibly works, while you want to find something true.
The high complexity of the genuine human reward function
In this section you show that the pair (p(0), R(0)) is high complexity, but it seems that p(0) could be complex and R(0) could be relatively simple, unlike the title suggests. We don’t actually need to find p(0), finding R(0) should be good enough.
Our hope is that with some minimal assumptions about planner and reward we can infer the rest with enough data.
Huh, isn’t that what I’m saying? Is the problem that the assumptions I mentioned are derived from observing the human?
Slight tangent: I realized that the major difference between a human and the agent H (from the first example in OP), is that the human can take complex inputs. In particular, it can take logical propositions about itself or desirable R(0) and approve or disapprove of them. I’m not saying that “find R(0) that a human would approve of” is a good algorithm, but something along those lines could be useful.
1 is trivial, so yes. But I don’t agree with 2. Maybe the disagreement comes from “few” and “obvious”? To be clear, I count evaluating some simple statistic on a large data set as one constraint. I’m not so sure about “obvious”. It’s not yet clear to me that my simple constraints aren’t good enough. But if you say that more complex constraints would give us a lot more confidence, that’s reasonable.
From OP I understood that you want to throw out IRL entirely. e.g.
If we give up the assumption of human rationality—which we must—it seems we can’t say anything about the human reward function. So it seems IRL must fail.
seems like an unambiguous rejection of IRL and very different from
Our hope is that with some minimal assumptions about planner and reward we can infer the rest with enough data.
Ok, we strongly disagree on your simple constraints being enough. I’d need to see these constraints explicitly formulated before I had any confidence in them. I suspect (though I’m not certain) that the more explicit you make them, the more tricky you’ll see that it is.
That’s a part of the disagreement. In the past you clearly thought that Occam’s razor was an “obvious” constraint that might work. Possibly you thought it was a unique such constraint. Then you found this result, and made a large update in the other direction. That’s why you say the result is big—rejecting a constraint that you already didn’t expect to work wouldn’t feel very significant.
On the other hand, I don’t think that Occam’s razor is unique such constraint. So when I see you reject it, I naturally ask “what about all the other obvious constraints that might work?”. To me this result reads like “0 didn’t solve our equation therefore the solution must be very hard”. I’m sure that you have strong arguments against many other approaches, but I haven’t seen them, and I don’t think the one in OP generalizes well.
I’d need to see these constraints explicitly formulated before I had any confidence in them.
This is a bit awkward. I’m sure that I’m not proposing anything that you haven’t already considered. And even if you show that this approach is wrong, I’d just try to put a band-aid on it. But here is an attempt:
First we’d need a data set of human behavior with both positive and negative examples (e.g. “I made a sandwitch”, “I didn’t stab myself”, etc). So it would be a set of tuples of state s, action a and +1 for positive examples, −1 for negative ones. This is not trivial to generate, especially it’s not clear how to pick negative examples, but here too I expect that the obvious solutions are all fine. By the way, I have no idea how the examples are formalized, that seems like a problem, but it’s not unique to this approach, so I’ll assume that it’s solved.
Next, given a pair (p, R), we would score it by adding up the following:
1. p(R) should accurately predict human behavior. So we want a count of p(R)(s)=a for positive cases and p(R)(s)!=a for negative cases.
2. R should also predict human behavior. So we want to sum R(s, a) for positive examples, minus the same sum for negative examples.
3. Regularization for p.
4. Regularization for R.
Here we are concerned about overfitting R, and don’t care about p as much, so terms 1 and 4 would get large weights, and terms 2, 3 would get smaller weights.
Finally we throw machine learning at the problem to maximize this score.
The point isn’t that there is nothing wrong or dangerous about learning biases and rewards. The point is that the OP is not very relevant to those concerns. The OP says that learning can’t be done without extra assumptions, but we have plenty of natural assumptions to choose from. The fact that assumptions are needed is interesting, but it is by no means a strong argument against IRL.
That’s an interesting question, because we obviously are taking actions that make the future more likely to be dystopian—we’re trying to develop AGI, which might turn out unfriendly.
You’d think so, but nobody has defined these assumptions in anything like sufficient detail to make IRL work. My whole research agenda is essentially a way of defining these assumptions, and it seems to be a long and complicated process.
Evaluating R on a single example of human behavior is good enough to reject R(2), R(4) and possibly R(3).
Example: this morning I went to the kitchen and picked up a knife. Among possible further actions, I had A—“make a sandwich” and B—“stab myself in the gut”. I chose A. R(2) and R(4) say I wanted B and R(3) is indifferent. I think that’s enough reason to discard them.
Why not do this? Do you not agree that this test discards dangerous R more often than useful R? My guess is that you’re asking for very strong formal guarantees from the assumptions that you consider and use a narrow interpretation of what it means to “make IRL work”.
Rejecting any specific R is easy—one bit of information (at most) per specific R. So saying “humans have preferences, and they are not always rational or always anti-rational” rules out R(1), R(2), and R(3). Saying “this apparent preference is genuine” rules out R(4).
But it’s not like there are just these five preferences and once we have four of them out of the way, we’re done. There are many, many different preferences in the space of preferences, and many, many of them will be simpler than R(0). So to converge to R(0), we need to add huge amounts of information, ruling out more and more examples.
Basically, we need to include enough information to define R(0) - which is what my research project is trying to do. What you’re seeing as “adding enough clear examples” is actually “hand-crafting R(0) in totality”.
For more details see here: https://arxiv.org/abs/1712.05812
My example test is not nearly as specific as you imply. It discards large swaths of harmful and useless reward functions. Additional test cases would restrict the space further. There are still harmful Rs in the remaining space, but their proportion must be much lower than in the beginning. Is that not good enough?
Are you saying that R can’t generalize if trained on a reasonably sized data set? This is very significant, if true, but I don’t see it.
Details are good. I have a few notes though.
This might be a nitpick, but there is no such thing. If the agent was not originally composed from p and R, then none of the decompositions are “true”. There are only “useful” decompositions. But that itself requires many assumptions about how usefulness is measured. I’m confused about how much of a problem this is. But it might be a big part of our philosophical difference—I want to slap together some ad hoc stuff that possibly works, while you want to find something true.
In this section you show that the pair (p(0), R(0)) is high complexity, but it seems that p(0) could be complex and R(0) could be relatively simple, unlike the title suggests. We don’t actually need to find p(0), finding R(0) should be good enough.
Huh, isn’t that what I’m saying? Is the problem that the assumptions I mentioned are derived from observing the human?
Slight tangent: I realized that the major difference between a human and the agent H (from the first example in OP), is that the human can take complex inputs. In particular, it can take logical propositions about itself or desirable R(0) and approve or disapprove of them. I’m not saying that “find R(0) that a human would approve of” is a good algorithm, but something along those lines could be useful.
We may not be disagreeing any more. Just to check, do you agree with both these statements:
Adding a few obvious constraints rule out many different R, including the ones in the OP.
Adding a few obvious constraints is not enough to get a safe or reasonable R.
1 is trivial, so yes. But I don’t agree with 2. Maybe the disagreement comes from “few” and “obvious”? To be clear, I count evaluating some simple statistic on a large data set as one constraint. I’m not so sure about “obvious”. It’s not yet clear to me that my simple constraints aren’t good enough. But if you say that more complex constraints would give us a lot more confidence, that’s reasonable.
From OP I understood that you want to throw out IRL entirely. e.g.
seems like an unambiguous rejection of IRL and very different from
Ok, we strongly disagree on your simple constraints being enough. I’d need to see these constraints explicitly formulated before I had any confidence in them. I suspect (though I’m not certain) that the more explicit you make them, the more tricky you’ll see that it is.
And no, I don’t want to throw IRL out (this is an old post), I want to make it work. I got this big impossibility result, and now I want to get around it. This is my current plan: https://www.lesswrong.com/posts/CSEdLLEkap2pubjof/research-agenda-v0-9-synthesising-a-human-s-preferences-into
That’s a part of the disagreement. In the past you clearly thought that Occam’s razor was an “obvious” constraint that might work. Possibly you thought it was a unique such constraint. Then you found this result, and made a large update in the other direction. That’s why you say the result is big—rejecting a constraint that you already didn’t expect to work wouldn’t feel very significant.
On the other hand, I don’t think that Occam’s razor is unique such constraint. So when I see you reject it, I naturally ask “what about all the other obvious constraints that might work?”. To me this result reads like “0 didn’t solve our equation therefore the solution must be very hard”. I’m sure that you have strong arguments against many other approaches, but I haven’t seen them, and I don’t think the one in OP generalizes well.
This is a bit awkward. I’m sure that I’m not proposing anything that you haven’t already considered. And even if you show that this approach is wrong, I’d just try to put a band-aid on it. But here is an attempt:
First we’d need a data set of human behavior with both positive and negative examples (e.g. “I made a sandwitch”, “I didn’t stab myself”, etc). So it would be a set of tuples of state s, action a and +1 for positive examples, −1 for negative ones. This is not trivial to generate, especially it’s not clear how to pick negative examples, but here too I expect that the obvious solutions are all fine. By the way, I have no idea how the examples are formalized, that seems like a problem, but it’s not unique to this approach, so I’ll assume that it’s solved.
Next, given a pair (p, R), we would score it by adding up the following:
1. p(R) should accurately predict human behavior. So we want a count of p(R)(s)=a for positive cases and p(R)(s)!=a for negative cases.
2. R should also predict human behavior. So we want to sum R(s, a) for positive examples, minus the same sum for negative examples.
3. Regularization for p.
4. Regularization for R.
Here we are concerned about overfitting R, and don’t care about p as much, so terms 1 and 4 would get large weights, and terms 2, 3 would get smaller weights.
Finally we throw machine learning at the problem to maximize this score.