Adam Shai comments on Testing which LLM architectures can do hidden serial reasoning

Adam Shai 16 Dec 2024 17:16 UTC
6 points
0
What is the y-axis in your plots? Where would 100% accuracy be?
- Filip Sondej 16 Dec 2024 19:48 UTC
  2 points
  0
  Parent
  Ah. That’s the number of solved tasks, after it is shown a set {length 1 task, length 2 task, … , length (n+1) task}, and n is the longest task length is has solved so far. So you can think of it roughly as the maximum task length it is able to solve. So it doesn’t have an upper bound.
  
  I clarified this in the post now. Thanks for catching it.
  - Adam Shai 16 Dec 2024 22:08 UTC
    3 points
    0
    Parent
    Thanks. I really like this task!
    It’s hard for me to interpret these results without some indication of how good these networks actually are at the task though. E.g. it is possible that even though a network could solve a length=N task once out of however many attempts you made, that it just got lucky, or is running some other heuristic that just happens to work for that one time. I understand why you were interested in how things scale with length of problem given your interest in recurrence and processing depth. But would it be hard to make a plot where x axis is length of problem, and y axis is accuracy or loss?
    - Filip Sondej 17 Dec 2024 19:21 UTC
      3 points
      0
      Parent
      Yup, here is such a plot, made after training “switcher” architecture for 350k examples. I remember it was similar for the longer training—a few longest task lengths struggle, but the rest is near 100%.