The discussion of corrigibility beginning with very simple programs like Return_Zeros and building up complexity gradually with Return_Input, Run_Python_Script and beyond is interesting. It helps make clear that corrigibility isn’t a particularly narrow target or especially challenging for software in general, or even for some more intelligent systems. It’s specifically at the point when a program starts to become a powerful optimizer or to take on more agentic qualities that it starts to seem really difficult and unclear how to maintain corrigibility.
Post somewhat inspired by Eliezers “well why didn’t anyone else write it then?”
The discussion of corrigibility beginning with very simple programs like Return_Zeros and building up complexity gradually with Return_Input, Run_Python_Script and beyond is interesting. It helps make clear that corrigibility isn’t a particularly narrow target or especially challenging for software in general, or even for some more intelligent systems. It’s specifically at the point when a program starts to become a powerful optimizer or to take on more agentic qualities that it starts to seem really difficult and unclear how to maintain corrigibility.
For posterity or anyone who doesn’t know which post from Eliezer this is referring to, it’s Let’s See You Write That Corrigibility Tag.