Here are some of the questions that make me think that you might get a wide variety of simulations even just with base models:
Do you think that non-human-entities can be useful for predicting what humans will say next? For example, imaginary entities from fictions humans share? Or artificial entities they interact with?
Do you think that the concepts a base model learns from predicting human text could recombine in ways that simulate entities that didn’t appear in the training data? A kind of generalization?
But I guess I broadly agree with you that base-models are likely to have primarily human-level-simulations (there are of course many of these!). But that’s not super reassuring, because we’re not just interested in base models, but mostly in agents that are driven by LLMs one way or another.
When I said “human-simulators” above, that short-hand was intended to include simulations of things like: multiple human authors, editors, and even reviewers cooperating to produce a very high-quality text; the product of a wikipedia community repeatedly adding to and editing a page; or any fictional character as portrayed by a human author (or multiple human authors, if the character is shared). So not just single human authors writing autobiographically, but basically any token-producing process involving humans whose output can be found on the Internet or other training data sources.
[I] think that you might get a wide variety of simulations even just with base models:
Completely agreed
Do you think that the concepts a base model learns from predicting human text could recombine in ways that simulate entities that didn’t appear in the training data? A kind of generalization?
Absolutely. To the extent that the model learns accurate world models/human psychological models, it may (or may not) be able to accurately extrapolate some distance outside its training distribution. Even where it can’t do so accurately, it may well still try.
I broadly agree with you that base-models are likely to have primarily human-level-simulations (there are of course many of these!). But that’s not super reassuring, because we’re not just interested in base models, but mostly in agents that are driven by LLMs one way or another.
I agree, most models of concern are the product of applying RL, DPO, or some similar process to a base model. My point was that your choice of a Chinchilla base model (I assume, AKAIK no one has made and instruct-trained version of Chinchilla) seemed odd, given what you’re trying to do.
Conceptually, I’m not sure I would view even an RL instruct-trained model as containing “a single agent” plus “a broad contextual distribution of human-like simulators available to it”. Instead I’d view it as “a broad contextual distribution of human simulators” containing “a narrower contextual distribution of (possibly less human-like) agent simulators encouraged/amplified and modified by the RL process”. I’d be concerned that the latter distribution could, for example, be bimodal, containing both a narrow distribution of helpful, harmless, and honest assistant personas, and a second narrow distribution their Waluigi Effect exact opposite evil twins (such as the agent summoned by the DAN jailbreak). So I question ELK’s underlying assumption that there exists a single “the agent” entity with a single well-defined set of latent knowledge that one could aim to elicit. I think you need to worry about all of the entire distribution of simulated personas the system is likely to simulate in 1st person, including ones that may only be elicited by specific obscure circumstances that could arise at some point, such as jailbreaks or unusual data in the context.
If, as I suspect, the process in the second half of the layers of a model is somewhat sequential, and specifically is one where the model first figures out some form of average or ground truth opinion/facts/knowledge, and then suitably adjusts this for the current context/persona, then it should be possible to elicit the former and then interpret the circuitry for the latter process and determine what it could do to the former.
Yes, that’s very reasonable. My initial goal when first thinking about how to explore CCS was to use CCS with RL-tuned models and to study the development of coherent beliefs as the system is ‘agentised’. We didn’t get that far because we ran into problems with CCS first.
To be frank, the reasons for using Chinchilla are boring and a mixture of technical/organisational. If we did the project again now, we would use Gemini models with ‘equivalent’ finetuned and base models, but given our results so far we didn’t think the effort of setting all that up and analysing it properly was worth the opportunity cost. We did a quick sense-check with an instruction-tuned T5 model that things didn’t completely fall apart, but I agree that the lack of ‘agentised’ models is a significant limitation of our experimental results. I don’t think it changes the conceptual points very much though—I expect to see things like simulated-knowledge in agentised LLMs too.
Here are some of the questions that make me think that you might get a wide variety of simulations even just with base models:
Do you think that non-human-entities can be useful for predicting what humans will say next? For example, imaginary entities from fictions humans share? Or artificial entities they interact with?
Do you think that the concepts a base model learns from predicting human text could recombine in ways that simulate entities that didn’t appear in the training data? A kind of generalization?
But I guess I broadly agree with you that base-models are likely to have primarily human-level-simulations (there are of course many of these!). But that’s not super reassuring, because we’re not just interested in base models, but mostly in agents that are driven by LLMs one way or another.
When I said “human-simulators” above, that short-hand was intended to include simulations of things like: multiple human authors, editors, and even reviewers cooperating to produce a very high-quality text; the product of a wikipedia community repeatedly adding to and editing a page; or any fictional character as portrayed by a human author (or multiple human authors, if the character is shared). So not just single human authors writing autobiographically, but basically any token-producing process involving humans whose output can be found on the Internet or other training data sources.
Completely agreed
Absolutely. To the extent that the model learns accurate world models/human psychological models, it may (or may not) be able to accurately extrapolate some distance outside its training distribution. Even where it can’t do so accurately, it may well still try.
I agree, most models of concern are the product of applying RL, DPO, or some similar process to a base model. My point was that your choice of a Chinchilla base model (I assume, AKAIK no one has made and instruct-trained version of Chinchilla) seemed odd, given what you’re trying to do.
Conceptually, I’m not sure I would view even an RL instruct-trained model as containing “a single agent” plus “a broad contextual distribution of human-like simulators available to it”. Instead I’d view it as “a broad contextual distribution of human simulators” containing “a narrower contextual distribution of (possibly less human-like) agent simulators encouraged/amplified and modified by the RL process”. I’d be concerned that the latter distribution could, for example, be bimodal, containing both a narrow distribution of helpful, harmless, and honest assistant personas, and a second narrow distribution their Waluigi Effect exact opposite evil twins (such as the agent summoned by the DAN jailbreak). So I question ELK’s underlying assumption that there exists a single “the agent” entity with a single well-defined set of latent knowledge that one could aim to elicit. I think you need to worry about all of the entire distribution of simulated personas the system is likely to simulate in 1st person, including ones that may only be elicited by specific obscure circumstances that could arise at some point, such as jailbreaks or unusual data in the context.
If, as I suspect, the process in the second half of the layers of a model is somewhat sequential, and specifically is one where the model first figures out some form of average or ground truth opinion/facts/knowledge, and then suitably adjusts this for the current context/persona, then it should be possible to elicit the former and then interpret the circuitry for the latter process and determine what it could do to the former.
Yes, that’s very reasonable. My initial goal when first thinking about how to explore CCS was to use CCS with RL-tuned models and to study the development of coherent beliefs as the system is ‘agentised’. We didn’t get that far because we ran into problems with CCS first.
To be frank, the reasons for using Chinchilla are boring and a mixture of technical/organisational. If we did the project again now, we would use Gemini models with ‘equivalent’ finetuned and base models, but given our results so far we didn’t think the effort of setting all that up and analysing it properly was worth the opportunity cost. We did a quick sense-check with an instruction-tuned T5 model that things didn’t completely fall apart, but I agree that the lack of ‘agentised’ models is a significant limitation of our experimental results. I don’t think it changes the conceptual points very much though—I expect to see things like simulated-knowledge in agentised LLMs too.