TurnTrout comments on Actually, Othello-GPT Has A Linear Emergent World Representation

TurnTrout 15 Apr 2023 17:56 UTC
LW: 6 AF: 3
0
AF
Rather than just learning surface level statistics about the distribution of moves, it learned to model the underlying process that generated that data. In my opinion, it’s already pretty obvious that transformers can do something more than statistical correlations and pattern matching, see eg induction heads, but it’s great to have clearer evidence of fully-fledged world models!
This updated me slightly upwards on “LLMs trained on text learn to model the underlying world, without needing multimodal inputs to pin down more of the world’s e.g. spatial properties.” I previously had considered that any given corpus could have been generated by a large number of possible worlds, but I now don’t weight this objection as highly.
- Neel Nanda 15 Apr 2023 18:19 UTC
  LW: 2 AF: 1
  0
  AF Parent
  
  I previously had considered that any given corpus could have been generated by a large number of possible worlds, but I now don’t weight this objection as highly.
  
  Interesting, I hadn’t seen that objection before! Can you say more? (Though maybe not if you aren’t as convinced by it any more). To me, it’d be that there’s many worlds but they all share some commonalities and those commonalities are modelled. Or possibly that the model separately simulates the different worlds.
  - TurnTrout 17 Apr 2023 22:59 UTC
    LW: 2 AF: 2
    0
    AF Parent
    So, first, there’s an issue where the model isn’t “remembering” having “seen” all of the text. It was updated by gradients taken over its outputs on the historical corpus. So there’s a subtlety, such that “which worlds are consistent with observations” is a wrongly-shaped claim. (I don’t think you fell prey to that mistake in OP, to be clear.)
    Second, on my loose understanding of metaphysics (ie this is reasoning which could very easily be misguided), there exist computable universes which contain entities training this language model given this corpus / set of historical signals, such that this entire setup is specified by the initial state of the laws of physics. In that case, the corpus and its regularities (“dogs” and “syntax” and such) wouldn’t necessarily reflect the world the agent was embedded in, which could be anything, really. Like maybe there’s an alien species on a gas giant somewhere which is training on fictional sequences of tokens, some of which happen to look like “dog”.
    Of course, by point (1), what matters isn’t the corpus itself (ie what sentences appear) but how that corpus imprints itself into the network via the gradients. And your post seems like evidence that even a relatively underspecified corpus (sequences of legal Othello moves) appears to imprint itself into the network, such that the network has a world model of the data generator (i.e. how the game works in real life).
    Does this make sense? I have some sense of having incommunicated poorly here, but hopefully this is better than leaving your comment unanswered.