Neel Nanda comments on Actually, Othello-GPT Has A Linear Emergent World Representation

Neel Nanda 19 Apr 2023 20:54 UTC
LW: 4 AF: 3
2
AF
Thanks! I also feel more optimistic now about speed research :) (I’ve tried similar experiments since, but with much less success—there’s a bunch of contingent factors around not properly hitting flow and not properly clearing time for it though). I’d be excited to hear what happens if you try it! Though I should clarify that writing up the results took a month of random spare non-work time...

Re models can be deeply understood, yes, I think you raise a valid and plausible concern and I agree that my work is not notable evidence against. Though also, idk man, it seems basically unfalsifiable. My intuition is that there may be some threshold of “we cannot deeply interpret past this”, but no one knows where it is (and most people assumed “we cannot deeply interpret at all”! Or something similar). And that every interpretability win is evidence that boundary is further on (or non-existent).

Fuzzy intuition: It doesn’t distinguish between the boundary being far away vs non-existent, but IMO the correct prior before seeing mech interp work at all was to have some distribution over the point where we hit a wall, and some probability on never hitting a wall. The longer we go without hitting a wall, the higher the posterios probability on never hitting a wall should be.

Translate by X is bad notation—it means “take the coordinate in the “mine vs their’s” direction, and set it to -X times its original value”. It should really be flip and scale by X or something (it came from an initial iteration of the method).