I agree that the shutdown problem is hard. There’s a way to circumvent it, and I think that’s what will actually be pursued in the first AGI alignment attempts.
That is to make the alignment goal instruction-following. That includes following the instructions to shut down or do nothing. There’s no conflict between shutdown and the primary goal.
This alignment goal isn’t trivial, but it’s much simpler and therefore easier to get right than full alignment to the good-of-humanity, whatever that is.
Given the difficulties, I very much doubt that anyone is going to try launching an AGI that is aligned to the good of all humanity, but can be reliably shut down we decide we’ve gotten that definition wrong.
This is more like Christiano’s corrigibility concept the Eliezer’s but probably distinct from both.
I’ll be writing more about this soon. After writing that post, I’ve been increasingly convinced that that is the most likely alignment goal for the first, critical AGIs. And that this probably makes alignment easier, while making human power dynamics concerningly relevant.
I agree that the first AGIs will probably be trained to follow instructions/DWIM. I also agree that, if you succeed in training agents to follow instructions, then you get shutdownability as a result. But I’m interested to know why you think instruction-following is much simpler and therefore easier than alignment with the good of humanity. And setting aside alignment with the good of humanity, do you think training AGIs to follow instructions will be easy in an absolute sense?
Good questions. To me following instructions seems vastly simpler than working out what’s best for all of humanity (and what counts as humanity) for an unlimited future. “Solving ethics” is often listed as a major obstacle to alignment, and I think we’ll just punt on that difficult issue and align it to want to follow our current instructions instead of our inmost desires, let alone all of humanity’s.
I realize this isn’t fully satisfactory, so I’d like to delve into this more. It seems much simpler to guess “what did this individual mean by this request” than to guess “what does all of humanity want for all of time”. Desires are poorly defined and understood. And what counts as humanity will become quite blurry if we get the ability to create AGIs and modify humans.
WRT ease, it seems like current LLMs already understand our instructions pretty well. So any AGI that incorporates LLMs or similar linguistic training will already be in the ballpark. And that’s all it has to be, as long as it checks with the user before taking impactful actions.
It’s critical that in my linked post on DWIM, I’m including a “and check” portion. It seems like pretty trivial overhead for the AGI to briefly summarize the plan it came up with and ask for approval from its human operator, particularly for impactful plans.
WRT occasionally misunderstanding intentions and whether an action is “impactful” enough to check before executing actions: there’s a bunch of stuff you can do to institute internal crosschecks in an LLM agent’s internal thinking. See my Internal independent review for language model agent alignment if you’re interested.
But you don’t just need your AI system to understand instructions. You also need to ensure its terminal goal is to follow instructions. And that seems like the hard part.
Yes, that’s a hard part. But specifying the goal accurately is often regarded as a potential failure point. So, if I’m right that this is a simpler, easier-to-specify alignment goal, that’s progress. It also has the advantage of incorporating corrigibility as a by product; so it’s resistant to partial failure—if you can tell that something went wrong in time, the AGI can be asked to shut down.
I agree that the shutdown problem is hard. There’s a way to circumvent it, and I think that’s what will actually be pursued in the first AGI alignment attempts.
That is to make the alignment goal instruction-following. That includes following the instructions to shut down or do nothing. There’s no conflict between shutdown and the primary goal.
This alignment goal isn’t trivial, but it’s much simpler and therefore easier to get right than full alignment to the good-of-humanity, whatever that is.
Given the difficulties, I very much doubt that anyone is going to try launching an AGI that is aligned to the good of all humanity, but can be reliably shut down we decide we’ve gotten that definition wrong.
I say a little more about this in Corrigibility or DWIM is an attractive primary goal for AGI and Roger Dearnaley goes into much more depth in his Requirements for a Basin of Attraction to Alignment.
This is more like Christiano’s corrigibility concept the Eliezer’s but probably distinct from both.
I’ll be writing more about this soon. After writing that post, I’ve been increasingly convinced that that is the most likely alignment goal for the first, critical AGIs. And that this probably makes alignment easier, while making human power dynamics concerningly relevant.
I agree that the first AGIs will probably be trained to follow instructions/DWIM. I also agree that, if you succeed in training agents to follow instructions, then you get shutdownability as a result. But I’m interested to know why you think instruction-following is much simpler and therefore easier than alignment with the good of humanity. And setting aside alignment with the good of humanity, do you think training AGIs to follow instructions will be easy in an absolute sense?
Good questions. To me following instructions seems vastly simpler than working out what’s best for all of humanity (and what counts as humanity) for an unlimited future. “Solving ethics” is often listed as a major obstacle to alignment, and I think we’ll just punt on that difficult issue and align it to want to follow our current instructions instead of our inmost desires, let alone all of humanity’s.
I realize this isn’t fully satisfactory, so I’d like to delve into this more. It seems much simpler to guess “what did this individual mean by this request” than to guess “what does all of humanity want for all of time”. Desires are poorly defined and understood. And what counts as humanity will become quite blurry if we get the ability to create AGIs and modify humans.
WRT ease, it seems like current LLMs already understand our instructions pretty well. So any AGI that incorporates LLMs or similar linguistic training will already be in the ballpark. And that’s all it has to be, as long as it checks with the user before taking impactful actions.
It’s critical that in my linked post on DWIM, I’m including a “and check” portion. It seems like pretty trivial overhead for the AGI to briefly summarize the plan it came up with and ask for approval from its human operator, particularly for impactful plans.
WRT occasionally misunderstanding intentions and whether an action is “impactful” enough to check before executing actions: there’s a bunch of stuff you can do to institute internal crosschecks in an LLM agent’s internal thinking. See my Internal independent review for language model agent alignment if you’re interested.
But you don’t just need your AI system to understand instructions. You also need to ensure its terminal goal is to follow instructions. And that seems like the hard part.
Yes, that’s a hard part. But specifying the goal accurately is often regarded as a potential failure point. So, if I’m right that this is a simpler, easier-to-specify alignment goal, that’s progress. It also has the advantage of incorporating corrigibility as a by product; so it’s resistant to partial failure—if you can tell that something went wrong in time, the AGI can be asked to shut down.
WRT to the difficulty of using the AGI’s understanding as its terminal goal, I think it’s not trivial, but quite do-able, at least in some of the AGI architecture we can anticipate. See my two short posts Goals selected from learned knowledge: an alternative to RL alignment and The (partial) fallacy of dumb superintelligence.
Thanks, I’ll check those out.