Model Psychology is the study of how LLMs simulates the behavior of the agentic entities[3] it is simulating.
I’m not a huge fan of this definition:
Studying models from a black box perspective (what I’d typically call model psych) can be more general than LLMs
This definition assumes a particular frame for thinking about LLMs which seems somewhat misleading to me.
We don’t just care about the character/personas that models predict, we also might care about questions like “when does the LLM make mistakes and what does that imply about the algorithm it’s using to predict various things” (this can hold even in cases where character/persona are unimportant)
Maybe it would have been better to call it LLM psychologie Indeed. I used this formulation because it seemed to be used quite a lot in the field.
In later posts I’ll showcase why this framing makes sense, it is quite hard to argue without them right now. I’ll come back to this comment later.
I think the current definition does not exclude this. I am talking about the study of agentic entities and their behaviors. Making a mistake is included in this. Something interesting would be to understand wether all the simulacra are making the same mistake, or whether it is only some specific simulacra that are making it. And what in the context is influencing it.
I think the current definition does not exclude this. I am talking about the study of agentic entities and their behaviors. Making a mistake is included in this. Something interesting would be to understand wether all the simulacra are making the same mistake, or whether it is only some specific simulacra that are making it. And what in the context is influencing it.
It seems weird to think of it as “the simulacra making a mistake” in many cases where the model makes a prediction error.
Like suppose I prompt the model with:
[user@computer ~]$ python
Python 3.11.5 (main, Sep 2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> x = random.random()
>>> y = random.random()
>>> x
0.9818489460280343
>>> y
0.7500874791464012
>>> x + y
And suppose the model gets the wrong answer. Is this the Python simulacra making a mistake?
(edit: this would presumably work better with a base model, but even non-base models can be prompted to act much more like base models in many cases.)
After your edit, I think I am seeing the confusion now. I agree that studying Oracles and Tools predictions are interesting, but it is out of the scope of LLM Psychology. I choosed to narrow down my approach to studying the behaviors of agentic entities as I think it is where the most interesting questions arise. Maybe I should clarify this in the post.
Note that I’ve chosen to narrow down my approach of LLM psychology to the agentic entities, mainly because the scary or interesting things to study with a psychological approach are either the behaviors of those entities, or the capability that they are able to use.
I added this to the Definition. Does it resolve your concerns for this point?
The thing is that, when you ask this to ChatGPT, it is still the simulacrum ChatGPT that is going to answer, not an oracle prediction (like you can see in base models). If you want to know the capability of the underlying simulator with chat models, you need to sample sufficiently enough simulacra to be sure that the mistakes comes from the simulator lack of capability and not the simulacra preferences (or modes as Janus call them). For math, it is often not important to check different simulacrum, because each simulacrum tends to share the math ability (unless you use some other weird languages, @Ethan Edwards might be able to jump in here). But for other capability (like biology or cooking), changing the simulacrum with which you interact with does have a big impact on the performance of the model. You can see that in GPT-4′s technical report, languages impact performance a lot. Using another language is one of the way to modulate the simulacrum you are interacting with. I’ll showcase in the next batch of post how you can take control a bit more accurately. Tell me if you need more precision
A stated definition of the goals of the model psych research you’re doing
An evaluation of the hypotheses you’re presenting in terms of those goals
Controlled measurements (not only examples)
For example, from the previous post on stochastic parrots, I infer that one of your goal is predicting what capabilities models have. If that is the case, then the evaluation should be “given a specific range of capabilities, predict which of them models will and won’t have”, and the list should be established before any measurement is made, and maybe even before a model is trained/released (since this is where these predictions would be the most useful for AI safety, I’d love to know if the deadly model is GPT-4 or GPT-9).
I don’t know much about human psych, but it seems to me that it is most useful when it describes some behavior quantitatively with controlled predictions (à la CBT), and not when it does qualitatively analysis based on personal experience with the subject (à la Freud).
I’m not a huge fan of this definition:
Studying models from a black box perspective (what I’d typically call model psych) can be more general than LLMs
This definition assumes a particular frame for thinking about LLMs which seems somewhat misleading to me.
We don’t just care about the character/personas that models predict, we also might care about questions like “when does the LLM make mistakes and what does that imply about the algorithm it’s using to predict various things” (this can hold even in cases where character/persona are unimportant)
After consideration, I think it makes sense to change the narrative to Large Language Model Psychology instead of Model Psychology as it is too vague.
Maybe it would have been better to call it LLM psychologie Indeed. I used this formulation because it seemed to be used quite a lot in the field.
In later posts I’ll showcase why this framing makes sense, it is quite hard to argue without them right now. I’ll come back to this comment later.
I think the current definition does not exclude this. I am talking about the study of agentic entities and their behaviors. Making a mistake is included in this. Something interesting would be to understand wether all the simulacra are making the same mistake, or whether it is only some specific simulacra that are making it. And what in the context is influencing it.
It seems weird to think of it as “the simulacra making a mistake” in many cases where the model makes a prediction error.
Like suppose I prompt the model with:
And suppose the model gets the wrong answer. Is this the Python simulacra making a mistake?
(edit: this would presumably work better with a base model, but even non-base models can be prompted to act much more like base models in many cases.)
After your edit, I think I am seeing the confusion now. I agree that studying Oracles and Tools predictions are interesting, but it is out of the scope of LLM Psychology. I choosed to narrow down my approach to studying the behaviors of agentic entities as I think it is where the most interesting questions arise. Maybe I should clarify this in the post.
I added this to the Definition. Does it resolve your concerns for this point?
The thing is that, when you ask this to ChatGPT, it is still the simulacrum ChatGPT that is going to answer, not an oracle prediction (like you can see in base models). If you want to know the capability of the underlying simulator with chat models, you need to sample sufficiently enough simulacra to be sure that the mistakes comes from the simulator lack of capability and not the simulacra preferences (or modes as Janus call them). For math, it is often not important to check different simulacrum, because each simulacrum tends to share the math ability (unless you use some other weird languages, @Ethan Edwards might be able to jump in here). But for other capability (like biology or cooking), changing the simulacrum with which you interact with does have a big impact on the performance of the model. You can see that in GPT-4′s technical report, languages impact performance a lot. Using another language is one of the way to modulate the simulacrum you are interacting with.
I’ll showcase in the next batch of post how you can take control a bit more accurately.
Tell me if you need more precision
What I’d like to see in the coming posts:
A stated definition of the goals of the model psych research you’re doing
An evaluation of the hypotheses you’re presenting in terms of those goals
Controlled measurements (not only examples)
For example, from the previous post on stochastic parrots, I infer that one of your goal is predicting what capabilities models have. If that is the case, then the evaluation should be “given a specific range of capabilities, predict which of them models will and won’t have”, and the list should be established before any measurement is made, and maybe even before a model is trained/released (since this is where these predictions would be the most useful for AI safety, I’d love to know if the deadly model is GPT-4 or GPT-9).
I don’t know much about human psych, but it seems to me that it is most useful when it describes some behavior quantitatively with controlled predictions (à la CBT), and not when it does qualitatively analysis based on personal experience with the subject (à la Freud).