Sam Marks comments on What’s up with LLMs representing XORs of arbitrary features?

Sam Marks 5 Jan 2024 10:23 UTC
LW: 3 AF: 1
0
AF
If anyone would like to replicate these results, the code can be found in the rax branch of my geometry-of-truth repo. This was adapted from a codebase I used on a different project, so there’s a lot of uneeded stuff in this repo. The important parts here are:
- The datasets: cities_alice.csv and neg_cities_alice.csv (for the main experiments I describe), cities_distractor.csv and neg_cities_distractor.csv (for the experiments with banana/shed at the end of factual statements), and xor.csv (for the experiments with true/false and banana/shed after random text).
- xor_probing.ipynb: my code for doing the probing and making the plots. This assumes that the activations have already been extracted and saved using generate_acts.py (see the readme for info about how to use generate_acts.py).
Unless you want to do PCA visualizations, I’d probably recommend just taking my datasets and quickly writing your own code to do the probing experiments, rather than spending time trying to figure out my infrastructure here.