Jonathan Claybrough comments on Why do we assume there is a “real” shoggoth behind the LLM? Why not masks all the way down?

Jonathan Claybrough 11 Mar 2023 18:09 UTC
2 points
−1
Other commenters have argued about the correctness of using Shoggoth. I think it’s mostly a correct term if you take it in the Lovcraftian sense, and that currently we don’t understand LLMs that much. Interpretability might work and we might progress so we’re not sure they actually are incomprehensible like Shoggoths (though according to wikipedia, they’re made of physics, so probably advanced civilizations could get to a point where they could understand them, the analogy holds surprisingly well!)
Anyhow it’s a good meme and useful to say “hey, we don’t understand these things as well as you might imagine from interacting with the smiley face” to describe our current state of knowledge.

Now for trying to construct some idea of what it is.
I’ll argue a bit against calling an LLM as a pile of masks, as that seems to carry implications which I don’t believe in. The question we’re asking ourselves is something like “what kind of algorithms/patterns do we expect to see appear when an LLM is trained? Do those look like a pile of masks, or some more general simulator that creates masks on the fly?” and the answer depends on specifics and optimization pressure. I wanna sketch out different stages we could hope to see and understand better (and I’d like for us to test this empirically and find out how true this is). Earlier stages don’t disappear, as they’re still useful at all times, though other things start weighing more in the next token predictions.

Level 0 : Incompetent, random weights, no information about the real world or text space.

Level 1 “Statistics 101” : Dumb heuristics doesn’t take word positions into account.
It knows facts about the world like token distribution and uses that.

Level 2 “Statistics 201″ : Better heuristics, some equivalent to Markov chains.
Its knowledge of text space increases, it produces idioms, reproduces common patterns. At this stage it already contains huge amount of information about the world. It “knows” stuff like mirrors are more likely to break and cause 7 years of jinx.

Level 3 “+ Simple algorithms”: Some pretty specific algorithms appear (like Indirect Object Identification), which can search for certain information and transfer it in more sophisticated ways. Some of these algorithms are good enough they might not be properly described as heuristics anymore, but instead really representing the actual rules as strongly as they exist in language (like rules of grammar properly applied). Note these circuits appear multiple times and tradeoff against other things so overall behavior is still stochastic, there are heuristics on how much to weight these algorithms and other info.

Level 4 “Simulating what created that text” : This is where it starts to have more and more powerful in context learning, ie. its weights represent algorithms which do in context search (and combines with its vast encyclopedic knowledge of texts, tropes, genres) and figure out consistencies in characters or concepts introduced in the prompt. For example it’ll pick up on Alice and Bobs’ different backgrounds, latent knowledge on them, their accents.
But it only does that because that’s what authors generally do, and it has the same reasoning errors common to tropes. That’s because it simulates not the content of the text (the characters in the story), but the thing which generates the story (the writer, who themselves have some simulation of the characters).

So uh, do masks or pile of masks fit anywhere in this story ? Not that I see. The mask is a specific metaphor for the RLHF finetunning which causes mode collapse and makes the LLM mostly only play the nice assistant (and its opposites). It’s a constraint or bridle or something, but if the training is light (doesn’t affect the weights too much), then we expect the LLM to mostly be the same, and that was not masks.

Nor are there piles of masks. It’s a bunch of weights really good at token prediction, learning more and more sophisticated strategies for this. It encodes stereotypes at different places (maybe french=seduction or SF=techbro), but I don’t believe these map out to different characters. Instead, I expect it at level 4, there’s a more general algorithm which pieces together the different knowledge, that it in context learns to simulate certain agents. Thus, if you just take mask to mean “character”, the LLM isn’t a pile of them, but a machine which can produce them on demand.

(In this view of LLMs, x-risk happens because we feed some input where the LLM simulates an agentic deceptive self aware agent which steers the outputs until it escapes the box)