leogao comments on What’s up with LLMs representing XORs of arbitrary features?

leogao 5 Jan 2024 8:59 UTC
LW: 17 AF: 5
0
AF
I’m having some trouble replicating this result in a not exactly comparable setting (internal model, looking at is_alice xor amazon_sentiment). I get 90%+ on the constituent datasets, but only up to 75% on the xor depending on which layer I look at.
(low confidence, will update as I get more results)
- Sam Marks 5 Jan 2024 10:16 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Thanks for doing this! Can you share the dataset that you’re working with? I’m traveling right now, but when I get a chance I might try to replicate your failed replication on LLaMA-2-13B and with my codebase (which can be found here; see especially xor_probing.ipynb).
  - leogao 5 Jan 2024 11:07 UTC
    LW: 1 AF: 1
    0
    AF Parent
    The training set is a random 100k subsample of this dataset: https://huggingface.co/datasets/amazon_polarity
    
    I’m prepending Alice/Bob and doing the xor of the label in exactly the same way you do.