Rohin Shah comments on Learning Normativity: A Research Agenda

Rohin Shah 11 Jan 2021 20:26 UTC
LW: 2 AF: 2
AF
Planned summary for the Alignment Newsletter:
To build aligned AI systems, we need to have our AI systems learn what to do from human feedback. However, it is unclear how to interpret that feedback: any particular piece of feedback could be wrong; economics provides many examples of stated preferences diverging from revealed preferences. Not only would we like our AI system to be uncertain about the interpretation about any particular piece of feedback, we would also like it to _improve_ its process for interpreting human feedback. This would come from human feedback on the meta-level process by which the AI system learns. This gives us _process-level feedback_, where we make sure the AI system gets the right answers _for the right reasons_.
For example, perhaps initially we have an AI system that interprets human statements literally. Switching from this literal interpretation to a Gricean interpretation (where you also take into account the fact that the human chose to say this statement rather than other statements) is likely to yield improvements, and human feedback could help the AI system do this. (See also [Gricean communication and meta-preferences](https://www.alignmentforum.org/posts/8NpwfjFuEPMjTdriJ/gricean-communication-and-meta-preferences), [Communication Prior as Alignment Strategy](https://www.alignmentforum.org/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy), and [multiple related CHAI papers](https://www.alignmentforum.org/posts/zAvhvGa6ToieNGuy2/communication-prior-as-alignment-strategy?commentId=uWBFsKK6XFbL4Hs4z).)
Of course, if we learn _how_ to interpret human feedback, that too is going to be uncertain. We can fix this by “going meta” once again: learning how to learn to interpret human feedback. Iterating this process we get an infinite tower of “levels” of learning, and at every level we assume that feedback is not perfect and the loss function we are using is also not perfect.
In order for this to actually be feasible, we clearly need to share information across these various “levels” (or else it would take infinite time to learn across all of the levels). The AI system should not just learn to decrease the probability assigned to a single hypothesis, it should learn what _kinds_ of hypotheses tend to be good or bad.
Planned opinion (the second post is Recursive Quantilizers II):
**On feedback types:** It seems like the scheme introduced here is relying quite strongly on the ability of humans to give good process-level feedback _at arbitrarily high levels_. It is not clear to me that this is something humans can do: it seems to me that when thinking at the meta level, humans often fail to think of important considerations that would be obvious in an object-level case. I think this could be a significant barrier to this scheme, though it’s hard to say without more concrete examples of what this looks like in practice.
**On interaction:** I’ve previously <@argued@>(@Human-AI Interaction@) that it is important to get feedback _online_ from the human; giving feedback “all at once” at the beginning is too hard to do well. However, the idealized algorithm here does have the feedback “all at once”. It’s possible that this is okay, if it is primarily process-level feedback, but it seems fairly worrying to me.
**On desiderata:** The desiderata introduced in the first post feel stronger than they need to be. It seems possible to specify a method of interpreting feedback that is _good enough_: it doesn’t exactly capture everything, but it gets it sufficiently correct that it results in good outcomes. This seems especially true when talking about process-level feedback, or feedback one meta level up—as long as the AI system has learned an okay notion of “being helpful” or “being corrigible”, then it seems like we’re probably fine.
Often just making feedback uncertain can help. For example, in the preference learning literature, Boltzmann rationality has emerged as the model of choice for how to interpret human feedback. While there are several theoretical justifications for this model, I suspect its success is simply because it makes feedback uncertain: if you want to have a model that assigns higher likelihood to high-reward actions, but still assigns some probability to all actions, it seems like you end up choosing the Boltzmann model (or something functionally equivalent). Note that there is work trying to improve upon this model, such as by [modeling humans as pedagogic](https://papers.nips.cc/paper/2016/file/b5488aeff42889188d03c9895255cecc-Paper.pdf), or by <@incorporating a notion of similarity@>(@LESS is More: Rethinking Probabilistic Models of Human Behavior@).
So overall, I don’t feel convinced that we need to aim for learning at all levels. That being said, the second post introduced a different argument: that the method does as well as we “could” do given the limits of human reasoning. I like this a lot more as a desideratum; it feels more achievable and more tied to what we care about.