It has this deep challenge of distinguishing human-simulators from direct-reporters, and properties like negation-consistency—which could be equally true of each—probably don’t help much with that in the worst case.
Base model LLMs like Chinchilla are trained by SGD as human-token-generation-processs simulators. So all you’re going to find are human-simulators. In a base model, there cannot be a “model’s secret latent knowledge” for you to be able to find a direct reporter of. Something like that might arise under RL instruct training intended to encourage the model to normally simulate a consistent category of personas (generally aimed at honest, helpful, and harmless assistant personas), but it would not yet exist in an SGD-trained base model: that’s all human-simulators all the way down. The closest thing to latent knowledge that you could find in a base model is “the ground-truth Bayesian-averaged opinon of all humans about X, before applying context-dependant biases to model the biases of different types of humans found in different places on the internet”. Personally I’d look for that in layers about half-way through the model: I’d expect the sequence to be “figure out average, then apply context-specific biases”.
Here are some of the questions that make me think that you might get a wide variety of simulations even just with base models:
Do you think that non-human-entities can be useful for predicting what humans will say next? For example, imaginary entities from fictions humans share? Or artificial entities they interact with?
Do you think that the concepts a base model learns from predicting human text could recombine in ways that simulate entities that didn’t appear in the training data? A kind of generalization?
But I guess I broadly agree with you that base-models are likely to have primarily human-level-simulations (there are of course many of these!). But that’s not super reassuring, because we’re not just interested in base models, but mostly in agents that are driven by LLMs one way or another.
When I said “human-simulators” above, that short-hand was intended to include simulations of things like: multiple human authors, editors, and even reviewers cooperating to produce a very high-quality text; the product of a wikipedia community repeatedly adding to and editing a page; or any fictional character as portrayed by a human author (or multiple human authors, if the character is shared). So not just single human authors writing autobiographically, but basically any token-producing process involving humans whose output can be found on the Internet or other training data sources.
[I] think that you might get a wide variety of simulations even just with base models:
Completely agreed
Do you think that the concepts a base model learns from predicting human text could recombine in ways that simulate entities that didn’t appear in the training data? A kind of generalization?
Absolutely. To the extent that the model learns accurate world models/human psychological models, it may (or may not) be able to accurately extrapolate some distance outside its training distribution. Even where it can’t do so accurately, it may well still try.
I broadly agree with you that base-models are likely to have primarily human-level-simulations (there are of course many of these!). But that’s not super reassuring, because we’re not just interested in base models, but mostly in agents that are driven by LLMs one way or another.
I agree, most models of concern are the product of applying RL, DPO, or some similar process to a base model. My point was that your choice of a Chinchilla base model (I assume, AKAIK no one has made and instruct-trained version of Chinchilla) seemed odd, given what you’re trying to do.
Conceptually, I’m not sure I would view even an RL instruct-trained model as containing “a single agent” plus “a broad contextual distribution of human-like simulators available to it”. Instead I’d view it as “a broad contextual distribution of human simulators” containing “a narrower contextual distribution of (possibly less human-like) agent simulators encouraged/amplified and modified by the RL process”. I’d be concerned that the latter distribution could, for example, be bimodal, containing both a narrow distribution of helpful, harmless, and honest assistant personas, and a second narrow distribution their Waluigi Effect exact opposite evil twins (such as the agent summoned by the DAN jailbreak). So I question ELK’s underlying assumption that there exists a single “the agent” entity with a single well-defined set of latent knowledge that one could aim to elicit. I think you need to worry about all of the entire distribution of simulated personas the system is likely to simulate in 1st person, including ones that may only be elicited by specific obscure circumstances that could arise at some point, such as jailbreaks or unusual data in the context.
If, as I suspect, the process in the second half of the layers of a model is somewhat sequential, and specifically is one where the model first figures out some form of average or ground truth opinion/facts/knowledge, and then suitably adjusts this for the current context/persona, then it should be possible to elicit the former and then interpret the circuitry for the latter process and determine what it could do to the former.
Yes, that’s very reasonable. My initial goal when first thinking about how to explore CCS was to use CCS with RL-tuned models and to study the development of coherent beliefs as the system is ‘agentised’. We didn’t get that far because we ran into problems with CCS first.
To be frank, the reasons for using Chinchilla are boring and a mixture of technical/organisational. If we did the project again now, we would use Gemini models with ‘equivalent’ finetuned and base models, but given our results so far we didn’t think the effort of setting all that up and analysing it properly was worth the opportunity cost. We did a quick sense-check with an instruction-tuned T5 model that things didn’t completely fall apart, but I agree that the lack of ‘agentised’ models is a significant limitation of our experimental results. I don’t think it changes the conceptual points very much though—I expect to see things like simulated-knowledge in agentised LLMs too.
I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.
By “false claims” do you mean claims in contradiction to “the ground-truth Bayesian-averaged opinion of all humans about X, before applying context-dependant biases to model the biases of different types of humans found in different places on the internet” (something I already said I thought was likely in the model)? Or do you mean claims in contradiction to objective facts where those differ from the above? If the latter, how could the model possibly determine that? It doesn’t have an objective facts oracle, it just has its training data and the ability to do approximately Bayesian reasoning to construct a world-model from that during the training process.
The fact that a sequence predictor is modeling the data-generating process like the humans who wrote all text that it trained on, and who it is trying to infer, doesn’t mean it can’t learn a concept corresponding to truth or reality. That concept would, pragmatically, be highly useful for predicting tokens, and so one would have good reason for it to exist. As Dick put it, reality is that which doesn’t go away.
An example of how it doesn’t just boil down to ‘averaged opinion of humans about X’ might be theory of mind examples, where I expect that a good LLM would predict ‘future’ tokens correctly despite that being opposed to the opinion of all humans involved. Because reality doesn’t go away, and then the humans change their opinion.
For example, a theory of mind vignette like ‘Mom bakes cookies; her children all say they want to eat the cookies that are in the kitchen; unbeknownst to them, Dad has already eaten all the cookies. They go into the kitchen and they: ______’*. The averaged opinion of the children is 100% “they see the cookies there”; and yet, sadly, there are not cookies there, and they will learn better. This is the sort of reasoning where you can learn a simpler and more robust algorithm to predict if you have a concept of ‘reality’ or ‘truth’ which is distinct simply from speaker beliefs or utterances.
You could try to fake it by learning a different ‘reality’ concept tailored to every different possible scenario, every level of story telling or document indirection, and each time have to learn levels of fictionality or ‘average opinion of the storyteller’ from scratch (‘the average storyteller of this sort of vignette believes there are not cookies there within the story’), sure… but that would be complicated and difficult to learn and not predict well. If an LLM can regard every part of reality as just another level in a story, then it can also regard a story as just another level of reality.
And it’s simpler to maintain 1 reality and encode everything else as small deviations from it. (Which is how we write stories or tell lies or consider hypotheticals: even a fantasy story typically only breaks a relatively few rules—water will remain wet, fire will still burn under most conditions, etc. If fantasy stories didn’t represent very small deviations from reality, we would be unable to understand them without extensive training on each one separately. Even an extreme exercise like The Gostak, where you have to build a dictionary as you go, or Greg Egan’s mathematical novels, where you need a physics/math degree to understand the new universes despite them only tweaking like 1 equation, still rely very heavily on our pre-existing reality as a base to deviate from.)
* GPT-4: “Find no cookies. They may then question where the cookies went, or possibly seek to make more.”
A good point. My use of “averaged” was the wrong word, the actual process would be some approximation to Bayesianism: if sufficient evidence exists in the training set of something, and it’s useful for token prediction, then a large enough LLM could discover it during the training process to build a world model of it, regardless of whether any human already has (though this process may likely be easier if some humans have and have written about it in the training set). Nevertheless, this world model element gets deployed to predict a variety of human-like token-generation behavior, including speaker’s utterances. A base model doesn’t have a specific preferred agent or narrow category of agents (such as helpful, harmless, and honest agents post instruct-training), but it does have a toolkit of world model elements and the ability to figure out when they apply, that an ELK-process could attempt to locate. Some may correspond to truth/reality (about humans, or at least fictional characters), and some to common beliefs (for example, elements in Catholic theology, or vampire lore). I’m dubious that these two categories will turn out to be stored in easily-recognizedly distinct ways, short of doing interpreability work to look for “Are we in a Catholic/vampriric context?” circuitry as opposed to “Are we in a context where theory-of-mind matters?” circuitry. If we’re lucky, facts that are pretty universally applicable might tend to be in different, perhaps closer-to-the-middle-of-the-stack layers to ones whose application is very conditional on circumstances or language. But some aspects of truth/reality are still very conditional in when they apply to human token generation processes, at least as much so as things that are a matter of opinion, or that considered as facts are part of Sociology or Folklore rather then Physics or Chemistry.
Base model LLMs like Chinchilla are trained by SGD as human-token-generation-processs simulators. So all you’re going to find are human-simulators. In a base model, there cannot be a “model’s secret latent knowledge” for you to be able to find a direct reporter of. Something like that might arise under RL instruct training intended to encourage the model to normally simulate a consistent category of personas (generally aimed at honest, helpful, and harmless assistant personas), but it would not yet exist in an SGD-trained base model: that’s all human-simulators all the way down. The closest thing to latent knowledge that you could find in a base model is “the ground-truth Bayesian-averaged opinon of all humans about X, before applying context-dependant biases to model the biases of different types of humans found in different places on the internet”. Personally I’d look for that in layers about half-way through the model: I’d expect the sequence to be “figure out average, then apply context-specific biases”.
Here are some of the questions that make me think that you might get a wide variety of simulations even just with base models:
Do you think that non-human-entities can be useful for predicting what humans will say next? For example, imaginary entities from fictions humans share? Or artificial entities they interact with?
Do you think that the concepts a base model learns from predicting human text could recombine in ways that simulate entities that didn’t appear in the training data? A kind of generalization?
But I guess I broadly agree with you that base-models are likely to have primarily human-level-simulations (there are of course many of these!). But that’s not super reassuring, because we’re not just interested in base models, but mostly in agents that are driven by LLMs one way or another.
When I said “human-simulators” above, that short-hand was intended to include simulations of things like: multiple human authors, editors, and even reviewers cooperating to produce a very high-quality text; the product of a wikipedia community repeatedly adding to and editing a page; or any fictional character as portrayed by a human author (or multiple human authors, if the character is shared). So not just single human authors writing autobiographically, but basically any token-producing process involving humans whose output can be found on the Internet or other training data sources.
Completely agreed
Absolutely. To the extent that the model learns accurate world models/human psychological models, it may (or may not) be able to accurately extrapolate some distance outside its training distribution. Even where it can’t do so accurately, it may well still try.
I agree, most models of concern are the product of applying RL, DPO, or some similar process to a base model. My point was that your choice of a Chinchilla base model (I assume, AKAIK no one has made and instruct-trained version of Chinchilla) seemed odd, given what you’re trying to do.
Conceptually, I’m not sure I would view even an RL instruct-trained model as containing “a single agent” plus “a broad contextual distribution of human-like simulators available to it”. Instead I’d view it as “a broad contextual distribution of human simulators” containing “a narrower contextual distribution of (possibly less human-like) agent simulators encouraged/amplified and modified by the RL process”. I’d be concerned that the latter distribution could, for example, be bimodal, containing both a narrow distribution of helpful, harmless, and honest assistant personas, and a second narrow distribution their Waluigi Effect exact opposite evil twins (such as the agent summoned by the DAN jailbreak). So I question ELK’s underlying assumption that there exists a single “the agent” entity with a single well-defined set of latent knowledge that one could aim to elicit. I think you need to worry about all of the entire distribution of simulated personas the system is likely to simulate in 1st person, including ones that may only be elicited by specific obscure circumstances that could arise at some point, such as jailbreaks or unusual data in the context.
If, as I suspect, the process in the second half of the layers of a model is somewhat sequential, and specifically is one where the model first figures out some form of average or ground truth opinion/facts/knowledge, and then suitably adjusts this for the current context/persona, then it should be possible to elicit the former and then interpret the circuitry for the latter process and determine what it could do to the former.
Yes, that’s very reasonable. My initial goal when first thinking about how to explore CCS was to use CCS with RL-tuned models and to study the development of coherent beliefs as the system is ‘agentised’. We didn’t get that far because we ran into problems with CCS first.
To be frank, the reasons for using Chinchilla are boring and a mixture of technical/organisational. If we did the project again now, we would use Gemini models with ‘equivalent’ finetuned and base models, but given our results so far we didn’t think the effort of setting all that up and analysing it properly was worth the opportunity cost. We did a quick sense-check with an instruction-tuned T5 model that things didn’t completely fall apart, but I agree that the lack of ‘agentised’ models is a significant limitation of our experimental results. I don’t think it changes the conceptual points very much though—I expect to see things like simulated-knowledge in agentised LLMs too.
I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.
By “false claims” do you mean claims in contradiction to “the ground-truth Bayesian-averaged opinion of all humans about X, before applying context-dependant biases to model the biases of different types of humans found in different places on the internet” (something I already said I thought was likely in the model)? Or do you mean claims in contradiction to objective facts where those differ from the above? If the latter, how could the model possibly determine that? It doesn’t have an objective facts oracle, it just has its training data and the ability to do approximately Bayesian reasoning to construct a world-model from that during the training process.
The fact that a sequence predictor is modeling the data-generating process like the humans who wrote all text that it trained on, and who it is trying to infer, doesn’t mean it can’t learn a concept corresponding to truth or reality. That concept would, pragmatically, be highly useful for predicting tokens, and so one would have good reason for it to exist. As Dick put it, reality is that which doesn’t go away.
An example of how it doesn’t just boil down to ‘averaged opinion of humans about X’ might be theory of mind examples, where I expect that a good LLM would predict ‘future’ tokens correctly despite that being opposed to the opinion of all humans involved. Because reality doesn’t go away, and then the humans change their opinion.
For example, a theory of mind vignette like ‘Mom bakes cookies; her children all say they want to eat the cookies that are in the kitchen; unbeknownst to them, Dad has already eaten all the cookies. They go into the kitchen and they: ______’*. The averaged opinion of the children is 100% “they see the cookies there”; and yet, sadly, there are not cookies there, and they will learn better. This is the sort of reasoning where you can learn a simpler and more robust algorithm to predict if you have a concept of ‘reality’ or ‘truth’ which is distinct simply from speaker beliefs or utterances.
You could try to fake it by learning a different ‘reality’ concept tailored to every different possible scenario, every level of story telling or document indirection, and each time have to learn levels of fictionality or ‘average opinion of the storyteller’ from scratch (‘the average storyteller of this sort of vignette believes there are not cookies there within the story’), sure… but that would be complicated and difficult to learn and not predict well. If an LLM can regard every part of reality as just another level in a story, then it can also regard a story as just another level of reality.
And it’s simpler to maintain 1 reality and encode everything else as small deviations from it. (Which is how we write stories or tell lies or consider hypotheticals: even a fantasy story typically only breaks a relatively few rules—water will remain wet, fire will still burn under most conditions, etc. If fantasy stories didn’t represent very small deviations from reality, we would be unable to understand them without extensive training on each one separately. Even an extreme exercise like The Gostak, where you have to build a dictionary as you go, or Greg Egan’s mathematical novels, where you need a physics/math degree to understand the new universes despite them only tweaking like 1 equation, still rely very heavily on our pre-existing reality as a base to deviate from.)
* GPT-4: “Find no cookies. They may then question where the cookies went, or possibly seek to make more.”
A good point. My use of “averaged” was the wrong word, the actual process would be some approximation to Bayesianism: if sufficient evidence exists in the training set of something, and it’s useful for token prediction, then a large enough LLM could discover it during the training process to build a world model of it, regardless of whether any human already has (though this process may likely be easier if some humans have and have written about it in the training set). Nevertheless, this world model element gets deployed to predict a variety of human-like token-generation behavior, including speaker’s utterances. A base model doesn’t have a specific preferred agent or narrow category of agents (such as helpful, harmless, and honest agents post instruct-training), but it does have a toolkit of world model elements and the ability to figure out when they apply, that an ELK-process could attempt to locate. Some may correspond to truth/reality (about humans, or at least fictional characters), and some to common beliefs (for example, elements in Catholic theology, or vampire lore). I’m dubious that these two categories will turn out to be stored in easily-recognizedly distinct ways, short of doing interpreability work to look for “Are we in a Catholic/vampriric context?” circuitry as opposed to “Are we in a context where theory-of-mind matters?” circuitry. If we’re lucky, facts that are pretty universally applicable might tend to be in different, perhaps closer-to-the-middle-of-the-stack layers to ones whose application is very conditional on circumstances or language. But some aspects of truth/reality are still very conditional in when they apply to human token generation processes, at least as much so as things that are a matter of opinion, or that considered as facts are part of Sociology or Folklore rather then Physics or Chemistry.