Ok, that makes the real incentives quite different. Then, I suspect that these people are navigating facebook using the intuitions and strategies from the real world, without much consideration for the new digital environment.
zulupineapple
Yes, and you answered that question well. But the reason I asked for alternative responses, was so that I could compare them to unsolicited recommendations from the anime-fan’s point of view (and find that unsolicited recommendations have lower effort or higher reward).
Also, I’m not asking “How did your friend want the world to be different”, I’m asking “What action could your friend have taken to avoid that particular response?”. The friend is a rational agent, he is able to consider alternative strategies, but he shouldn’t expect that other people will change their behavior when they have no personal incentive to do so.
What is the domain of U? What inputs does it take? In your papers you take a generic Markov Decision Process, but which one will you use here? How exactly do you model the real world? What is the set of states and the set of actions? Does the set of states include the internal state of the AI?
You may have been referring to this as “4. Issues of ontology”, but I don’t think the problem can be separated from your agenda. I don’t see how any progress can be made without answering these questions. Maybe your can start with naive answers, and to move on to something more realistic later. If so I’m interested in what those naive world models look like. And I’m suspicious of how well human preferences would translate onto such models.
Other AI construction methods could claim that the AI will learn the optimal world model, by interacting with the world, but I don’t think this solution can work for your agenda, since the U function is fixed from the start.
Discounting. There is no law of nature that can force me to care about preventing human extinction years from now, more than eating a tasty sandwich tomorrow. There is also no law that can force me to care about human extinction much more that about my own death.
There are, of course, more technical disagreements to be had. Reasonable people could question how bad unaligned AI will be or how much progress is possible in this research. But unlike those questions, the reasons of discounting are not debatable.
I do things my way because I want to display my independence (not doing what others tell me) and intelligence (ability to come up with novel solutions), and because I would feel bored otherwise (this is a feature of how my brain works, I can’t help it).
“I feel independent and intelligent”, “other people see me as independent and intelligent”, “I feel bored” are all perfectly regular outcomes. They can be either terminal or instrumental goals. Either way, I disagree that these cases somehow don’t fit in the usual preference model. You’re only having this problem because you’re interpreting “outcome” in a very narrow way.
Yes. The latter seems to be what OP is asking about: “If one wanted it to not happen, how would one go about that?”. I assume OP is taking the perspective of his friends, who are annoyed by this behavior, rather than the perspective of the anime-fans, who don’t necessarily see anything wrong with the situation.
That sounds reasonable, but the proper thing is not usually the easy thing, and you’re not going to make people do the proper thing just by saying that it is proper.
If we want to talk about this as a problem in rationality, we should probably talk about social incentives, and possible alternative strategies for the anime-hater (you’re now talking about a better strategy for the anime-fan, but it’s not good to ask other people to solve your problems). Although I’m not sure to what extent this is a problem that needs solving.
And then the other person says “no thanks”, and you both stand in awkward silence? My point is that offering recommendations is a natural thing to say, even if not perfect, and it’s nice to have something to say. If you want to discourage unsolicited recommendations, then you need to propose a different trajectory for the conversation. Changing topic is hard, and simply going away is rude. People give unsolicited recommendations because it seems to be the best option available.
Sure, but it remains unclear what response the friend wanted from the other person. What better options are there? Should they just go away? Change topic? I’m looking for specific answers here.
a friend of mine observed that he couldn’t talk about how he didn’t like anime without a bunch of people rushing in to tell him that anime was actually good and recommending anime for him to watch
What response did your friend want? The reaction seems very natural to me (especially from anime fans). Note that your friend as at some point tried watching anime, and he has now chosen to talk about anime, which could easily mean that on some level he wants to like anime, or at least understand why others like it.
I got this big impossibility result
That’s a part of the disagreement. In the past you clearly thought that Occam’s razor was an “obvious” constraint that might work. Possibly you thought it was a unique such constraint. Then you found this result, and made a large update in the other direction. That’s why you say the result is big—rejecting a constraint that you already didn’t expect to work wouldn’t feel very significant.
On the other hand, I don’t think that Occam’s razor is unique such constraint. So when I see you reject it, I naturally ask “what about all the other obvious constraints that might work?”. To me this result reads like “0 didn’t solve our equation therefore the solution must be very hard”. I’m sure that you have strong arguments against many other approaches, but I haven’t seen them, and I don’t think the one in OP generalizes well.
I’d need to see these constraints explicitly formulated before I had any confidence in them.
This is a bit awkward. I’m sure that I’m not proposing anything that you haven’t already considered. And even if you show that this approach is wrong, I’d just try to put a band-aid on it. But here is an attempt:
First we’d need a data set of human behavior with both positive and negative examples (e.g. “I made a sandwitch”, “I didn’t stab myself”, etc). So it would be a set of tuples of state s, action a and +1 for positive examples, −1 for negative ones. This is not trivial to generate, especially it’s not clear how to pick negative examples, but here too I expect that the obvious solutions are all fine. By the way, I have no idea how the examples are formalized, that seems like a problem, but it’s not unique to this approach, so I’ll assume that it’s solved.
Next, given a pair (p, R), we would score it by adding up the following:
1. p(R) should accurately predict human behavior. So we want a count of p(R)(s)=a for positive cases and p(R)(s)!=a for negative cases.
2. R should also predict human behavior. So we want to sum R(s, a) for positive examples, minus the same sum for negative examples.
3. Regularization for p.
4. Regularization for R.
Here we are concerned about overfitting R, and don’t care about p as much, so terms 1 and 4 would get large weights, and terms 2, 3 would get smaller weights.
Finally we throw machine learning at the problem to maximize this score.
So it seems that there was progress in applied rationality and in AI. But that’s far from everything LW has talked about. What about more theoretical topics, general problems in philosophy, morality, etc? Do you feel than discussing some topics resulted in no progress and was a waste of time?
There’s some debate about which things are “improvements” as opposed to changes.
Important question. Does the debate actually exist, or is this a figure of speech?
1 is trivial, so yes. But I don’t agree with 2. Maybe the disagreement comes from “few” and “obvious”? To be clear, I count evaluating some simple statistic on a large data set as one constraint. I’m not so sure about “obvious”. It’s not yet clear to me that my simple constraints aren’t good enough. But if you say that more complex constraints would give us a lot more confidence, that’s reasonable.
From OP I understood that you want to throw out IRL entirely. e.g.
If we give up the assumption of human rationality—which we must—it seems we can’t say anything about the human reward function. So it seems IRL must fail.
seems like an unambiguous rejection of IRL and very different from
Our hope is that with some minimal assumptions about planner and reward we can infer the rest with enough data.
But it’s not like there are just these five preferences and once we have four of them out of the way, we’re done.
My example test is not nearly as specific as you imply. It discards large swaths of harmful and useless reward functions. Additional test cases would restrict the space further. There are still harmful Rs in the remaining space, but their proportion must be much lower than in the beginning. Is that not good enough?
What you’re seeing as “adding enough clear examples” is actually “hand-crafting R(0) in totality”.
Are you saying that R can’t generalize if trained on a reasonably sized data set? This is very significant, if true, but I don’t see it.
For more details see here: https://arxiv.org/abs/1712.05812
Details are good. I have a few notes though.
true decomposition
This might be a nitpick, but there is no such thing. If the agent was not originally composed from p and R, then none of the decompositions are “true”. There are only “useful” decompositions. But that itself requires many assumptions about how usefulness is measured. I’m confused about how much of a problem this is. But it might be a big part of our philosophical difference—I want to slap together some ad hoc stuff that possibly works, while you want to find something true.
The high complexity of the genuine human reward function
In this section you show that the pair (p(0), R(0)) is high complexity, but it seems that p(0) could be complex and R(0) could be relatively simple, unlike the title suggests. We don’t actually need to find p(0), finding R(0) should be good enough.
Our hope is that with some minimal assumptions about planner and reward we can infer the rest with enough data.
Huh, isn’t that what I’m saying? Is the problem that the assumptions I mentioned are derived from observing the human?
Slight tangent: I realized that the major difference between a human and the agent H (from the first example in OP), is that the human can take complex inputs. In particular, it can take logical propositions about itself or desirable R(0) and approve or disapprove of them. I’m not saying that “find R(0) that a human would approve of” is a good algorithm, but something along those lines could be useful.
This is true, but it doesn’t fit well with the given example of “When will [country] develop the nuclear bomb?”. The problem isn’t that people can’t agree what “nuclear bomb” means or who already has them. The problem is that people are working from different priors and extrapolating them in different ways.
Are you going to state your beliefs? I’m asking because I’m not sure what that looks like. My concern is that the statement will be very vague or very long and complex. Either way, you will have a lot of freedom to argue that actually your actions do match your statements, regardless of what those actions are. Then the statement would not be useful.
Instead I suggest that you should be accountable to people who share your beliefs. Having someone who disagrees with you try to model your beliefs and check your actions against that model seems like a source of conflict. Of course, stating your beliefs can be helpful in recognizing these people (but it is not the only method).
What’s the motivation? In what case is lower accuracy for higher consistency a reasonable trade off? Especially consistency over time sounds like something that would discourage updating on new evidence.
Evaluating R on a single example of human behavior is good enough to reject R(2), R(4) and possibly R(3).
Example: this morning I went to the kitchen and picked up a knife. Among possible further actions, I had A—“make a sandwich” and B—“stab myself in the gut”. I chose A. R(2) and R(4) say I wanted B and R(3) is indifferent. I think that’s enough reason to discard them.
Why not do this? Do you not agree that this test discards dangerous R more often than useful R? My guess is that you’re asking for very strong formal guarantees from the assumptions that you consider and use a narrow interpretation of what it means to “make IRL work”.
The point isn’t that there is nothing wrong or dangerous about learning biases and rewards. The point is that the OP is not very relevant to those concerns. The OP says that learning can’t be done without extra assumptions, but we have plenty of natural assumptions to choose from. The fact that assumptions are needed is interesting, but it is by no means a strong argument against IRL.
What if in reality due to effects currently beyond our understanding, our actions are making the future more likely to be dystopian in some way than if we took random actions?
That’s an interesting question, because we obviously are taking actions that make the future more likely to be dystopian—we’re trying to develop AGI, which might turn out unfriendly.
While it’s true that preferences are not immutable, the things that change them are not usually debate. Sure, some people can be made to believe that their preferences are inconsistent, but then they will only make the smallest correction needed to fix the problem. Also, sometimes debate will make someone claim to have changed their preferences, just to that they can avoid social pressures (e.g. “how dare you not care about starving children!”), but this may not reflect in their actions.
Regardless, my claim is that many (or most) people discount a lot, and that this would be stable under reflection. Otherwise we’d see more charity, more investment and more work on e.g. climate change.