The main source of complication is that language models are not by themselves very good at navigating the world. You’ll want to integrate a language model with other bits of AI that do other parts of modeling the state of the world and planning actions. If this integration is done in the simple and obvious way, it seems like you get problems with some parts of the AI essentially trying to Goodhart the language model. I wrote something about this back when GPT-2 came out, and I think our understanding is only somewhat better now.
I definitely understand Goodhart’s law and how to beat it a lot better now, but it’s still hard to translate that understanding into getting an AI that purely models language to do good things—I think we’re on firmer theoretical ground with AIs that model the world in general.
But I agree that “do what I mean” instruction following is a live possibility that we should try to anticipate obstacles for, so we can try to remove those obstacles.
The main source of complication is that language models are not by themselves very good at navigating the world. You’ll want to integrate a language model with other bits of AI that do other parts of modeling the state of the world and planning actions. If this integration is done in the simple and obvious way, it seems like you get problems with some parts of the AI essentially trying to Goodhart the language model. I wrote something about this back when GPT-2 came out, and I think our understanding is only somewhat better now.
I definitely understand Goodhart’s law and how to beat it a lot better now, but it’s still hard to translate that understanding into getting an AI that purely models language to do good things—I think we’re on firmer theoretical ground with AIs that model the world in general.
But I agree that “do what I mean” instruction following is a live possibility that we should try to anticipate obstacles for, so we can try to remove those obstacles.