I wrote this in response to the ACX CHAI + Assistance Games + Fully Updated Deference post, and am linking it here so I have a good reference.
I’m a PhD student at CHAI, where Stuart is one of my advisors. Here’s some of my perspectives on the corrigibility debate and CHAI in general. That being said, I’m posting this as myself and not on behalf of CHAI: all opinions are my own and not CHAI’s or UC Berkeley’s.
I think it’s a mistake to model CHAI as a unitary agent pursuing the Assistance Games agenda (hereafter, CIRL, for its original name of Cooperative Inverse Reinforcement Learning). It’s better to think of it as a collection of researchers pursuing various research agendas related to “make AI go well”. I’d say less than half of the people at CHAI focus specifically on AGI safety/AI x-risk. Of this group, maybe 75% of them are doing something related to value learning, and maybe 25% are working on CIRL in particular. For example, I’m not currently working on CIRL, in part due to the issues that Eliezer mentions in this post.
I do think Stuart is correct that the meta reason we want corrigibility is that we want to make the human + AI system maximize CEV. That is, the fundamental reason we want the AI to shut down when we ask it is precisely because we (probably) don’t think it’s going to do what we want! I tend to think of the value uncertainty approach to the off-switch game not as an algorithm that we should implement, but as a description of why we want to do corrigibility at all. Similarly, to use Rohin Shah’s metaphor, I think of CIRL as math poetry about what we want the AI to do, but not as an algorithm that we should actually try to implement explicitly.
I also agree with both Eliezer and Stuart that meta-learning human values is easier than hardcoding them.
That being said, I’m not currently working on CIRL/assistance games and think my current position is a lot closer to Eliezer’s than Stuart’s on this topic. I broadly agree with MIRI’s two objections:
Firstly, I’m not super bullish about algorithms that rely on having good explicit probability distributions over high-dimensional reward spaces. We currently don’t have any good techniques for getting explicit probability estimates on the sort of tasks we use current state-of-the-art AIs for, besides the trollish “ask a large language model for its uncertainty estimates” solution. I think it’s very likely that we won’t have a good technique for doing explicit probability updates before we get AGI.
Secondly, even though it’s more robust than directly specifying human values directly, I think that CIRL still has a very thin “margin of error”, in that if you misspecify either the prior or the human model you can get very bad behavior. (As illustrated in the humorous red paper clips example.) I agree that a CIRL agent with almost any prior + update rule we know how to write down will quickly stop being corrigible, in the sense that the information maximizing action is not to listen to humans and to shut down when asked. To put it another way, CIRL fails when you mess up the prior or update rule of your AI. Unfortunately, not only is this very likely, this is precisely when you want your AI to listen to you and shut down!
I do think corrigibility is important, and I think more people should work on it. But I’d prefer corrigibility solutions that work when you get the AI a fair bit wrong.
I don’t think my position is that unusual at CHAI. In fact, another CHAI grad student did an informal survey and found two-thirds agree more with Eliezer’s position than Stuart’s in the debate.
I will add that I think there’s a lot of talking past each other in this debate, and in debates in AI safety in general (which is probably unavoidable given the short format). In my experience, Stuart’s actual beliefs are quite a bit more nuanced than what’s presented here. For example, he has given a lot of thought on how to keep the AI constantly corrigible and prevent it from “exploiting” its current reward function. And I do think he has updated on the capabilities of say, GPT-3. But there’s research taste differences, some ontology differences, and different beliefs about future deep learning AI progress, all of which contribute to talking past each other.
On a completely unrelated aside, many CHAI grad students really liked the kabbalistic interpretation of CHAI’s name :)
I wrote this in response to the ACX CHAI + Assistance Games + Fully Updated Deference post, and am linking it here so I have a good reference.