A good point. My use of “averaged” was the wrong word, the actual process would be some approximation to Bayesianism: if sufficient evidence exists in the training set of something, and it’s useful for token prediction, then a large enough LLM could discover it during the training process to build a world model of it, regardless of whether any human already has (though this process may likely be easier if some humans have and have written about it in the training set). Nevertheless, this world model element gets deployed to predict a variety of human-like token-generation behavior, including speaker’s utterances. A base model doesn’t have a specific preferred agent or narrow category of agents (such as helpful, harmless, and honest agents post instruct-training), but it does have a toolkit of world model elements and the ability to figure out when they apply, that an ELK-process could attempt to locate. Some may correspond to truth/reality (about humans, or at least fictional characters), and some to common beliefs (for example, elements in Catholic theology, or vampire lore). I’m dubious that these two categories will turn out to be stored in easily-recognizedly distinct ways, short of doing interpreability work to look for “Are we in a Catholic/vampriric context?” circuitry as opposed to “Are we in a context where theory-of-mind matters?” circuitry. If we’re lucky, facts that are pretty universally applicable might tend to be in different, perhaps closer-to-the-middle-of-the-stack layers to ones whose application is very conditional on circumstances or language. But some aspects of truth/reality are still very conditional in when they apply to human token generation processes, at least as much so as things that are a matter of opinion, or that considered as facts are part of Sociology or Folklore rather then Physics or Chemistry.
A good point. My use of “averaged” was the wrong word, the actual process would be some approximation to Bayesianism: if sufficient evidence exists in the training set of something, and it’s useful for token prediction, then a large enough LLM could discover it during the training process to build a world model of it, regardless of whether any human already has (though this process may likely be easier if some humans have and have written about it in the training set). Nevertheless, this world model element gets deployed to predict a variety of human-like token-generation behavior, including speaker’s utterances. A base model doesn’t have a specific preferred agent or narrow category of agents (such as helpful, harmless, and honest agents post instruct-training), but it does have a toolkit of world model elements and the ability to figure out when they apply, that an ELK-process could attempt to locate. Some may correspond to truth/reality (about humans, or at least fictional characters), and some to common beliefs (for example, elements in Catholic theology, or vampire lore). I’m dubious that these two categories will turn out to be stored in easily-recognizedly distinct ways, short of doing interpreability work to look for “Are we in a Catholic/vampriric context?” circuitry as opposed to “Are we in a context where theory-of-mind matters?” circuitry. If we’re lucky, facts that are pretty universally applicable might tend to be in different, perhaps closer-to-the-middle-of-the-stack layers to ones whose application is very conditional on circumstances or language. But some aspects of truth/reality are still very conditional in when they apply to human token generation processes, at least as much so as things that are a matter of opinion, or that considered as facts are part of Sociology or Folklore rather then Physics or Chemistry.