I think I’m missing something. What does the story look like, where we have some feature we’re totally unsure of what it signifies, but we’re very sure that the model is using it?
Or from the other direction, I keep coming back to Jacob’s transformer with like 200 orthogonal activation directions that all look to make the model write good code. They all seemed to be producing about the exact same activation pattern 8 layers on. It didn’t seem like his model was particularly spoiled for activation space—so what is it all those extra directions were actually picking up on?
I think I’m missing something. What does the story look like, where we have some feature we’re totally unsure of what it signifies, but we’re very sure that the model is using it?
Or from the other direction, I keep coming back to Jacob’s transformer with like 200 orthogonal activation directions that all look to make the model write good code. They all seemed to be producing about the exact same activation pattern 8 layers on. It didn’t seem like his model was particularly spoiled for activation space—so what is it all those extra directions were actually picking up on?