Rohin Shah comments on AI Alignment Podcast: An Overview of Technical AI Alignment in 2018 and 2019 with Buck Shlegeris and Rohin Shah

Rohin Shah 18 Apr 2020 16:52 UTC
LW: 4 AF: 3
AF
I think I just completely agree with this and am confused what either Buck or I said that was in opposition to this view.
As a particular example, I don’t think what you said is in opposition to “we must make assumptions to bridge the is-ought gap”.
Like, you also can’t deduce that the sun is going to rise tomorrow—for that, you need to assume that the future will be like the past, or that the world is “simple”, or something like that (see the Problem of Induction). Nonetheless, I feel fine with having some assumption of that form, and baking such an assumption into an AI system. Similarly, just because I say “there needs to be some assumption that bridges the is-ought gap” doesn’t mean that everything is hopeless—there can be an assumption that we find very reasonable that we’re happy to bake into the AI system.
Here’s a stab at the assumption you’re making:
Suppose that “human standards” are a function $f$ that takes in a reasoner $r$ and quantifies how good $r$ is at moral reasoning. (This is an “is” fact about the world.) Then, if we select $r^{*} = arg {max}_{r} f (r)$ , then the things that $r^{*}$ claims are “ought” facts actually are “ought” facts.
This is similar to, though not the same as, what Buck said:
your system will do an amazing job of answering “is” questions about what humans would say about “ought” questions. And so I guess maybe you could phrase the second part as: to get your system to do things that match human preferences, use the fact that it knows how to make accurate “is” statements about human’s ought statements?
And my followup a bit later:
As Buck said, that lets you predict what humans would say about ought statements, which your assumption could then be, whatever humans say about ought statements, that’s what you ought to do. And that’s still an assumption. Maybe it’s a very reasonable assumption that we’re happy to put it into our AI system.
Also, re
But I think it’s because common sense is wrong here.
The common sense view is that humans have some set of true values hidden away somewhere, and that superhuman moral reasoning means doing a better job than humans at adhering to the true values. It just seems super obvious to our common sense that there is some fact of the matter about which patterns in human behavior are the true human values, and which are quirks of the decision-making process.
This seems like the same thing Buck is saying here:
Buck Shlegeris: I think I want to object to a little bit of your framing there. My stance on utility functions of humans isn’t that there are a bunch of complicated subtleties on top, it’s that modeling humans with utility functions is just a really sad state to be in. If your alignment strategy involves positing that humans behave as expected utility maximizers, I am very pessimistic about it working in the short term, and I just think that we should be trying to completely avoid anything which does that. It’s not like there’s a bunch of complicated sub-problems that we need to work out about how to describe us as expected utility maximizers, my best guess is that we would just not end up doing that because it’s not a good idea.
Tbc, I agree with Buck’s framing here, though maybe not as confidently—it seems plausible though unlikely (~10%?) to me that approximating humans as EU maximizers would turn out okay, even though it isn’t literally true.
- Charlie Steiner 19 Apr 2020 2:51 UTC
  LW: 4 AF: 3
  AF Parent
  Yeah, on going back and reading the transcript, I think I was just misinterpreting what you were talking about. I agree with what you were trying to illustrate to Lucas.
  (Though I still don’t think the is-ought distinction is a perfect analogy. This manifests in me also not thinking that the notion of “ought facts” would make it into a later draft of your assumption example—though on the other hand, talking about ought facts makes more sense when you’re trying to find a utility function type object, which [shock] is what you were talking about at that point in the podcast.)