Quintin Pope comments on Inner Misalignment in “Simulator” LLMs

Quintin Pope 31 Jan 2023 20:40 UTC
7 points
3
I don’t see why projecting logits from the residual stream should require anything like search. In fact, the logit lens seems like strong evidence against this being the case, since it shows that intermediate hidden representations are just one linear transformation away from making predictions about the vocab distribution.

It’s not like SGD is sampling random programs, conditioning only on those programs achieving low loss.
- Charlie Steiner 2 Feb 2023 8:21 UTC
  6 points
  2
  Parent
  Yeah, I’ll pile on in agreement.
  I feel like thinking of the internals of transformers as doing general search—especially search over things to simulate—is some kind of fallacy. The system as a whole (the transformer) outputs a simulation of the training distribution, but that doesn’t mean it’s made of parts that themselves do simulations, or that refer to “simulating a thing” as a basic part of some internal ontology.
  I think “classic” inner alignment failure (where some inner Azazel has preferences about the real world) is a procrustean bed—it fits an RL agent navigating the real world, but not so much a pure language model.
- Thane Ruthenis 1 Feb 2023 1:17 UTC
  3 points
  0
  Parent
  I mean, that just pushes the problem back by one step. If we take LLMs to be simulators, they’d necessarily need to have some function that maps the simulation-state to the probability over the output tokens (since, after all, the ground truth of reality they’re simulating isn’t probability distributions over tokens).
  And if LLMs work by gradually refining the probability distribution over the output which they keep in the residual stream, that would just imply that the “simulation-state ⇒ output distribution” functions are upstream of the residual stream — i. e., every intermediate layer both runs a simulation-step and figures out how to map that simulation’s new state into a distribution-over-outputs.
  Of course, it seems architecturally impossible for modern LLMs to run a general-purpose search at that step, but in my view it’s an argument against modern LLM architectures being AGI-complete, not against search being unnecessary.
  - Quintin Pope 2 Feb 2023 9:20 UTC
    8 points
    3
    Parent
    If we take LLMs to be simulators, they’d necessarily need to have some function that maps the simulation-state to the probability over the output tokens
    I disagree with this picture. “Simulators” just describes the external behavior of the model, and doesn’t imply LLMs internally function anything like the programs humans write when we want to simulate something, or like our intuitive notions of what a simulator ought to do.
    I think it’s better to start with what we’ve found of deep network internal structures, which seem to be exponentially large ensembles of fairly shallow paths, and then think about what sort of computational structures would be consistent with that information while also 1) achieving low loss, and 2) being plausibly findable by SGD from a random init.
    My tentative guess is that LLMs internally look like a fuzzy key-value lookup table over a vast quantity of (mostly shallow) patterns about text content. They do some sort of similarity matching between the input texts and the features that different stored patterns “expect” in any text to which the pattern applies. Any patterns which trigger then quickly add their predictions into the residual stream, similar to what’s described here.
    In such a structure, having any significant translation step between the internal states of the predictive patterns and the output logits would be a huge issue, because you’d have to replicate that translation across the network many times, not just once per layer, but many times per layer, because single layers are implementing many ~independent paths simultaneously.
    I do agree that LLM architectures seem poorly suited to learning the sorts of algorithms I think people imagine when they say stuff like “general purpose search”. However, I take that as an update against those sorts of algorithms being important for powerful cognition, essentially considering that transformers have been the SOTA architecture for over 5 years while remaining essentially unchanged, despite many, many people trying to improve on them.
    - Thane Ruthenis 2 Feb 2023 10:00 UTC
      4 points
      1
      Parent
      Fair enough, I don’t disagree that it’s how current LLMs likely work.
      I maintain, however, that it makes me very skeptical that their architecture is AGI-complete. In particular, I expect it’s incapable of supporting the sort of high-fidelity simulations that people often talk about in the context of e. g. accelerating alignment research. And that, on the contrary, the architectures that are powerful enough would be different enough to support search and therefore carry the dangers of inner misalignment.
      I can sort of see the alternate picture, though, where the shallow patterns they implement include some sort of general-enough planning heuristics that’d theoretically let them make genuinely novel inferences over enough steps. I think that’d run into severe inefficiencies… but my intuition on that is a bit difficult to unpack.
      Hm. Do you think the current LLM architectures are AGI-complete, if you scale them enough? If yes, how do you imagine they’d be carrying out novel inferences, mechanically? Inferences that require making use of novel abstractions?