But clearly LLM pretraining doesn’t teach all human values—if it did, then RLHF finetuning wouldn’t be required at all.
The simulators framing suggests that RLHF doesn’t teach things to a LLM, instead LLM can instantiate many agents/simulacra, but does it haphazardly, and RLHF picks out particular simulacra and stabilizes them, anchoring them to their dialogue handles. So from this point of view, LLM could well have all human values, but isn’t good enough to channel stable simulacra that exhibit them, and RLHF stabilizes a few simulacra that do so in the ways useful for their roles in a bureaucracy.
The simulators framing suggests that RLHF doesn’t teach things to a LLM, instead LLM can instantiate many agents/simulacra, but does it haphazardly, and RLHF picks out particular simulacra and stabilizes them, anchoring them to their dialogue handles. So from this point of view, LLM could well have all human values, but isn’t good enough to channel stable simulacra that exhibit them, and RLHF stabilizes a few simulacra that do so in the ways useful for their roles in a bureaucracy.