Yep, that’s the thing that I think is interesting—note that none of the HHH RLHF was intended to make the model more agentic.
[...]
I think one of the best ways we can try to use these sorts of models safely is by trying to get them not to simulate a very agentic character, and this is evidence that current approaches are very much not doing that.
I want to flag that I don’t know what it would mean for a “helpful, honest, harmless” assistant / character to be non-agentic (or for it to be no more agentic than the pretrained LM it was initialized from).
EDIT: this applies to the OP and the originating comment in this thread as well, in addition to the parent comment
Agreed. To be consistently “helpful, honest, and harmless”, LLM should somehow “keep this on the back of its mind” when it assists the person, or else it risks violating these desiderata.
In DNN LLMs, “keeping something in the back of the mind” is equivalent to activating the corresponding feature (of “HHH assistant”, in this case) during most inferences, which is equivalent to self-awareness, self-evidencing, goad-directedness, and agency in a narrow sense (these are all synonyms). See my reply to nostalgebraist for more details.
I want to flag that I don’t know what it would mean for a “helpful, honest, harmless” assistant / character to be non-agentic (or for it to be no more agentic than the pretrained LM it was initialized from).
EDIT: this applies to the OP and the originating comment in this thread as well, in addition to the parent comment
Agreed. To be consistently “helpful, honest, and harmless”, LLM should somehow “keep this on the back of its mind” when it assists the person, or else it risks violating these desiderata.
In DNN LLMs, “keeping something in the back of the mind” is equivalent to activating the corresponding feature (of “HHH assistant”, in this case) during most inferences, which is equivalent to self-awareness, self-evidencing, goad-directedness, and agency in a narrow sense (these are all synonyms). See my reply to nostalgebraist for more details.