I agree that the first AGIs will probably be trained to follow instructions/DWIM. I also agree that, if you succeed in training agents to follow instructions, then you get shutdownability as a result. But I’m interested to know why you think instruction-following is much simpler and therefore easier than alignment with the good of humanity. And setting aside alignment with the good of humanity, do you think training AGIs to follow instructions will be easy in an absolute sense?
Good questions. To me following instructions seems vastly simpler than working out what’s best for all of humanity (and what counts as humanity) for an unlimited future. “Solving ethics” is often listed as a major obstacle to alignment, and I think we’ll just punt on that difficult issue and align it to want to follow our current instructions instead of our inmost desires, let alone all of humanity’s.
I realize this isn’t fully satisfactory, so I’d like to delve into this more. It seems much simpler to guess “what did this individual mean by this request” than to guess “what does all of humanity want for all of time”. Desires are poorly defined and understood. And what counts as humanity will become quite blurry if we get the ability to create AGIs and modify humans.
WRT ease, it seems like current LLMs already understand our instructions pretty well. So any AGI that incorporates LLMs or similar linguistic training will already be in the ballpark. And that’s all it has to be, as long as it checks with the user before taking impactful actions.
It’s critical that in my linked post on DWIM, I’m including a “and check” portion. It seems like pretty trivial overhead for the AGI to briefly summarize the plan it came up with and ask for approval from its human operator, particularly for impactful plans.
WRT occasionally misunderstanding intentions and whether an action is “impactful” enough to check before executing actions: there’s a bunch of stuff you can do to institute internal crosschecks in an LLM agent’s internal thinking. See my Internal independent review for language model agent alignment if you’re interested.
But you don’t just need your AI system to understand instructions. You also need to ensure its terminal goal is to follow instructions. And that seems like the hard part.
Yes, that’s a hard part. But specifying the goal accurately is often regarded as a potential failure point. So, if I’m right that this is a simpler, easier-to-specify alignment goal, that’s progress. It also has the advantage of incorporating corrigibility as a by product; so it’s resistant to partial failure—if you can tell that something went wrong in time, the AGI can be asked to shut down.
I agree that the first AGIs will probably be trained to follow instructions/DWIM. I also agree that, if you succeed in training agents to follow instructions, then you get shutdownability as a result. But I’m interested to know why you think instruction-following is much simpler and therefore easier than alignment with the good of humanity. And setting aside alignment with the good of humanity, do you think training AGIs to follow instructions will be easy in an absolute sense?
Good questions. To me following instructions seems vastly simpler than working out what’s best for all of humanity (and what counts as humanity) for an unlimited future. “Solving ethics” is often listed as a major obstacle to alignment, and I think we’ll just punt on that difficult issue and align it to want to follow our current instructions instead of our inmost desires, let alone all of humanity’s.
I realize this isn’t fully satisfactory, so I’d like to delve into this more. It seems much simpler to guess “what did this individual mean by this request” than to guess “what does all of humanity want for all of time”. Desires are poorly defined and understood. And what counts as humanity will become quite blurry if we get the ability to create AGIs and modify humans.
WRT ease, it seems like current LLMs already understand our instructions pretty well. So any AGI that incorporates LLMs or similar linguistic training will already be in the ballpark. And that’s all it has to be, as long as it checks with the user before taking impactful actions.
It’s critical that in my linked post on DWIM, I’m including a “and check” portion. It seems like pretty trivial overhead for the AGI to briefly summarize the plan it came up with and ask for approval from its human operator, particularly for impactful plans.
WRT occasionally misunderstanding intentions and whether an action is “impactful” enough to check before executing actions: there’s a bunch of stuff you can do to institute internal crosschecks in an LLM agent’s internal thinking. See my Internal independent review for language model agent alignment if you’re interested.
But you don’t just need your AI system to understand instructions. You also need to ensure its terminal goal is to follow instructions. And that seems like the hard part.
Yes, that’s a hard part. But specifying the goal accurately is often regarded as a potential failure point. So, if I’m right that this is a simpler, easier-to-specify alignment goal, that’s progress. It also has the advantage of incorporating corrigibility as a by product; so it’s resistant to partial failure—if you can tell that something went wrong in time, the AGI can be asked to shut down.
WRT to the difficulty of using the AGI’s understanding as its terminal goal, I think it’s not trivial, but quite do-able, at least in some of the AGI architecture we can anticipate. See my two short posts Goals selected from learned knowledge: an alternative to RL alignment and The (partial) fallacy of dumb superintelligence.
Thanks, I’ll check those out.