gwern comments on Discussion: Challenges with Unsupervised LLM Knowledge Discovery

gwern 22 Dec 2023 19:18 UTC
6 points
5
The fact that a sequence predictor is modeling the data-generating process like the humans who wrote all text that it trained on, and who it is trying to infer, doesn’t mean it can’t learn a concept corresponding to truth or reality. That concept would, pragmatically, be highly useful for predicting tokens, and so one would have good reason for it to exist. As Dick put it, reality is that which doesn’t go away.

An example of how it doesn’t just boil down to ‘averaged opinion of humans about X’ might be theory of mind examples, where I expect that a good LLM would predict ‘future’ tokens correctly despite that being opposed to the opinion of all humans involved. Because reality doesn’t go away, and then the humans change their opinion.

For example, a theory of mind vignette like ‘Mom bakes cookies; her children all say they want to eat the cookies that are in the kitchen; unbeknownst to them, Dad has already eaten all the cookies. They go into the kitchen and they: ______’*. The averaged opinion of the children is 100% “they see the cookies there”; and yet, sadly, there are not cookies there, and they will learn better. This is the sort of reasoning where you can learn a simpler and more robust algorithm to predict if you have a concept of ‘reality’ or ‘truth’ which is distinct simply from speaker beliefs or utterances.

You could try to fake it by learning a different ‘reality’ concept tailored to every different possible scenario, every level of story telling or document indirection, and each time have to learn levels of fictionality or ‘average opinion of the storyteller’ from scratch (‘the average storyteller of this sort of vignette believes there are not cookies there within the story’), sure… but that would be complicated and difficult to learn and not predict well. If an LLM can regard every part of reality as just another level in a story, then it can also regard a story as just another level of reality.

And it’s simpler to maintain 1 reality and encode everything else as small deviations from it. (Which is how we write stories or tell lies or consider hypotheticals: even a fantasy story typically only breaks a relatively few rules—water will remain wet, fire will still burn under most conditions, etc. If fantasy stories didn’t represent very small deviations from reality, we would be unable to understand them without extensive training on each one separately. Even an extreme exercise like The Gostak, where you have to build a dictionary as you go, or Greg Egan’s mathematical novels, where you need a physics/math degree to understand the new universes despite them only tweaking like 1 equation, still rely very heavily on our pre-existing reality as a base to deviate from.)

* GPT-4: “Find no cookies. They may then question where the cookies went, or possibly seek to make more.”
- RogerDearnaley 22 Dec 2023 22:24 UTC
  1 point
  0
  Parent
  A good point. My use of “averaged” was the wrong word, the actual process would be some approximation to Bayesianism: if sufficient evidence exists in the training set of something, and it’s useful for token prediction, then a large enough LLM could discover it during the training process to build a world model of it, regardless of whether any human already has (though this process may likely be easier if some humans have and have written about it in the training set). Nevertheless, this world model element gets deployed to predict a variety of human-like token-generation behavior, including speaker’s utterances. A base model doesn’t have a specific preferred agent or narrow category of agents (such as helpful, harmless, and honest agents post instruct-training), but it does have a toolkit of world model elements and the ability to figure out when they apply, that an ELK-process could attempt to locate. Some may correspond to truth/reality (about humans, or at least fictional characters), and some to common beliefs (for example, elements in Catholic theology, or vampire lore). I’m dubious that these two categories will turn out to be stored in easily-recognizedly distinct ways, short of doing interpreability work to look for “Are we in a Catholic/vampriric context?” circuitry as opposed to “Are we in a context where theory-of-mind matters?” circuitry. If we’re lucky, facts that are pretty universally applicable might tend to be in different, perhaps closer-to-the-middle-of-the-stack layers to ones whose application is very conditional on circumstances or language. But some aspects of truth/reality are still very conditional in when they apply to human token generation processes, at least as much so as things that are a matter of opinion, or that considered as facts are part of Sociology or Folklore rather then Physics or Chemistry.