Rohin Shah comments on AXRP Episode 8 - Assistance Games with Dylan Hadfield-Menell

Rohin Shah 27 Jun 2021 11:07 UTC
LW: 3 AF: 3
AF
Planned summary for the Alignment Newsletter:
As with most other podcasts, I will primarily link you to my past summaries of the papers discussed in the episode. In this case they were all discussed in the special issue [AN #69](https://mailchi.mp/59ddebcb3b9a/an-69-stuart-russells-new-book-on-why-we-need-to-replace-the-standard-model-of-ai) on Human Compatible and the various papers relevant to it. Some points that I haven’t previously summarized:
1. The interviewee thinks of assistance games as an _analytical tool_ that allows us to study the process by which humans convey normative information (such as goals) to an AI system. Normally, the math we write down takes the objective as given, whereas an assistance game uses math that assumes there is a human with a communication channel to the AI system. We can thus talk mathematically about how the human communicates with the AI system.
2. This then allows us to talk about issues that might arise. For example, <@assistive bandits@>(@The Assistive Multi-Armed Bandit@) considers the fact that humans might be learning over time (rather than starting out as optimal).
3. By using assistance games, we build an expectation directly into the math that our AI systems will have ongoing oversight and adaptation, which seems significantly better than doing this on an ad hoc basis (as is currently the case). This should help both near-term and long-term systems.
4. One core question is how we can specify a communication mechanism that is robust to misspecification. We can operationalize this as: if your AI system is missing some relevant features about the world, how bad could outcomes be? For example, it seems like demonstrating what you want (i.e. imitation learning) is more robust than directly saying what the goal is.
5. One piece of advice for deep learning practitioners is to think about where the normative information for your AI system is coming from, and whether it is sufficient to convey what you want. For example, large language models have trillions of parameters, but only hundreds of decisions into choosing what data to train them on—is that enough? The language we train on has lots of normative content—does that compensate?
6. Dylan says: “if you’re interested in doing this type of work and you thought this conversation was fun and you’d like to have more conversations like it with me, I’ll invite you to [apply to MIT’s EECS PhD program](https://gradapply.mit.edu/eecs/apply/login/?next=/eecs/) next year and mention me in your application.”
Planned opinion:
I’m a big fan of thinking about how normative information is transferred from us to our agents—I frequently ask myself questions like “how does the agent get the information to know X”, where X is something normative like “wireheading is bad”.
In the case of large neural nets, I generally like assistance games as an analysis tool for thinking about how such AI systems should behave at deployment time, for the reasons outlined in the podcast. It’s less clear what the framework has to say about what should be done about training time, when we don’t expect to have a human in the loop (or we expect that to be a relative minority of our training data).
To be clear, this should be taken as an endorsement of thinking about assistance games: my point is just that (according to me) it is best to think of them in relation to deployment, not training. A framework doesn’t have to apply to everything in order to be well worth thinking about.