We could try creating AI systems that take the “artificial intentional stance” towards humans: that is, they model humans as agents that are trying to achieve some goals, and then we get the AI system to optimize for those inferred goals. We could do this by training an agent that jointly models the world and understands natural language, in order to ground the language into actual states of the world. The hope is that with this scheme, as the agent gets more capable, its understanding of what we want improves as well, so that it is robust to scaling up. However, the scheme has no protection against Goodharting, and doesn’t obviously care about metaethics.
Planned opinion:
I agree with the general spirit of “get the AI system to understand common sense; then give it instructions that it interprets correctly”. I usually expect future ML research to figure out the common sense part, so I don’t look for particular implementations (in this case, simultaneous training on vision and natural language), but just assume we’ll have that capability somehow. The hard part is then how to leverage that capability to provide _correctly interpreted_ instructions. It may be as simple as providing instructions in natural language, as this post suggests. I’m much less worried about instrumental subgoals in such a scenario, since part of “understanding what we mean” includes “and don’t pursue this instruction literally to extremes”. But we still need to figure out how to translate natural language instructions into actions.
Planned summary for the Alignment Newsletter:
Planned opinion: