Clément Dumas comments on What’s up with LLMs representing XORs of arbitrary features?

Clément Dumas 14 Jan 2024 0:58 UTC
1 point
0
Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?
- Hoagy 14 Jan 2024 8:02 UTC
  1 point
  0
  Parent
  Yeah I’d expect some degree of interference leading to >50% success on XORs even in small models.
  - Clément Dumas 16 Jan 2024 10:34 UTC
    1 point
    0
    Parent
    You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video