Roman Leventov comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Roman Leventov 15 Jan 2023 15:15 UTC
1 point
0
I suspect future language models will have beliefs in a more meaningful sense than current language models, but I don’t know in what sense exactly, and I don’t think this is necessarily essential for our purposes.
In Active Inference terms, the activations within current LLMs upon processing the context parameterise LLM’s predictions of the observations (future text), Q(y|x), where x is internal world model states, and y are expected observations—future tokens. So current LLMs do have beliefs.
Worry 2: Even if GPT-n develops “beliefs” in a meaningful sense, it isn’t obvious that GPT-n will actively “think about” whether a given natural language input is true. In particular, “the truth of this natural language input” may not be a useful enough feature for GPT-n to consistently compute and represent in its activations. Another way of framing this worry is that perhaps the model has superhuman beliefs, but doesn’t explicitly “connect these to language” – similar to how MuZero’s superhuman concepts aren’t connected to language.
This is already true for current LLMs. Outside of symbolic computing, language is a tool for communicating something about world states in our generative models, not the states of the model themselves. In other words, language is a tool for communicating, exchanging, and adjusting beliefs, not something we (humans and DNNs without symbolic computation modules) have beliefs about.
Thus, methods that involve using language in one way or another, including CCS, should methodologically be part of the process of building an effective and robust theory of mind explanations of GPT-N, which includes the generative world model inside that mind. In parallel, we should also think about creating model architectures and training techniques that allow aligning their world models with ours. This is what Friston’s vision is about.
In order for this “model alignment” to work robustly with the help of language, we must also “meta-align” with LLMs on some cognitive/intelligence disciplines that are usually implicit: semantics, philosophy of language, communication theories (in an anthropological or social sense, a-la “theory of communicative acts”, rather than Shannon’s theory of information). I call this type of alignment “intelligence alignment” in a post that I’ll publish soon.
It doesn’t help, unfortunately, that humans themselves are currently confused and are not aligned with each other regarding the theories of semantics and philosophy of language. Hopefully, we can converge on something and impart these theories to superhumans models. The recent progress in NLP is probably helpful.
Regarding “truth” representations and disambiguation of different “truth-like” beliefs
If we really assume a superhuman model, I think there are good reasons to make it situationally aware (beyond the scope of this comment; but in brief, I see this as the only effective way to prevent internal misalignment), and thus it will probably understand that the prompts that you propose are just ploys to elicit its beliefs, and will not represent them if it wants to conceal certain beliefs.
So, without ensuring that the model is incentivised (from the game-theoretic perspective) to communicate and align the model states (beliefs) with certain counterparties (as discussed above), the presented strategies are at best brittle, in the larger alignment scheme. But, if we ensure that all the game-theoretic and multi-polar incentives are in place (I wrote a little about this here; this is going to be difficult to do, but it is not the point of our discussion here), these particular worries/problems that you discuss, finding “truth” representation and disambiguating different “truths”, will dissolve. It will be just a matter of asking the model “do you believe X is true?”, basically.
Note, again, how the philosophy of language, semantics, and inner alignment are bigger problems here. If your and model’s world models are not inner-aligned (i. e., not equally grounded in reality), some linguistic statements can be misinterpreted, which makes, in turn, these methods for eliciting beliefs unreliable. Consider, for example, the question like “do you believe that killing anyone could be good?”, and humans and the model are inner misaligned on what “good” means. No matter how reliable your elicitation technique, what you elicit is useless garbage if you don’t already share a lot of beliefs.
It seems that it implies that the alignment process is unreliable unless humans and the model are already (almost) aligned; consequently, the alignment should start relatively early in training superhuman models, not after the model is already trained. Enter: model “development”, “upbringing”, vs. indiscriminate “self-supervised training” on text from the internet in random order.
What links here?
- Critique of some recent philosophy of LLMs’ minds by Roman Leventov (20 Jan 2023 12:53 UTC; 52 points)
- Roman Leventov's comment on Are there alternative to solving value transfer and extrapolation? by Stuart_Armstrong (10 Feb 2023 18:23 UTC; 1 point)

Roman Leventov comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Regarding “truth” representations and disambiguation of different “truth-like” beliefs