It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of “make sure you keep doing what these people say”, etc.
It seems like you could simply use an LLM’s knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There’s still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying “we have no idea how to define human values”, when LLMs can capture much of any definition you like.
It seems like those goals are all in the training set, because humans talk about those concepts. Corrigibility is elaborations of “make sure you keep doing what these people say”, etc.
It seems like you could simply use an LLM’s knowledge of concepts to define alignment goals, at least to a first approximation. I review one such proposal here. There’s still an important question about how perfectly that knowledge generalizes with continued learning, and to OOD future contexts. But almost no one is talking about those questions. Many are still saying “we have no idea how to define human values”, when LLMs can capture much of any definition you like.