cannot be even reasonably sure that the measurements taken and experiments performed are telling us what we think they are
It’s not clear to me why this follows. Couldn’t it be the case that even without a theory of what sorts of features we expect models to learn / use, we can detect what features they are in fact using?
what sorts of things in the environment do we expect models to pick up on
how do we expect models to process info from the environment
If we’re wrong about 1, I feel like we could find it out. But if we make wrong assumptions about 2, it makes a bit more sense to me that we could fail to find that out.
In any case, an example indicating how we could fail would probably be useful here.
For 1., we could totally find out that our AGI just plain cannot pick up on what a car or a dog is, and only classify/recognize their parts (or by halves, or just always misclassify them) but then not have any sense of what’s going on to cause it or how to fix it.
For 2. … I have no idea? I feel like that might be out of scope for what I want to think about. I don’t even know how I’d start attacking that problem in full generality or even in part.
I think I’m missing something. What does the story look like, where we have some feature we’re totally unsure of what it signifies, but we’re very sure that the model is using it?
Or from the other direction, I keep coming back to Jacob’s transformer with like 200 orthogonal activation directions that all look to make the model write good code. They all seemed to be producing about the exact same activation pattern 8 layers on. It didn’t seem like his model was particularly spoiled for activation space—so what is it all those extra directions were actually picking up on?
It’s not clear to me why this follows. Couldn’t it be the case that even without a theory of what sorts of features we expect models to learn / use, we can detect what features they are in fact using?
I guess there’s two things here:
what sorts of things in the environment do we expect models to pick up on
how do we expect models to process info from the environment
If we’re wrong about 1, I feel like we could find it out. But if we make wrong assumptions about 2, it makes a bit more sense to me that we could fail to find that out.
In any case, an example indicating how we could fail would probably be useful here.
For 1., we could totally find out that our AGI just plain cannot pick up on what a car or a dog is, and only classify/recognize their parts (or by halves, or just always misclassify them) but then not have any sense of what’s going on to cause it or how to fix it.
For 2. … I have no idea? I feel like that might be out of scope for what I want to think about. I don’t even know how I’d start attacking that problem in full generality or even in part.
I think I’m missing something. What does the story look like, where we have some feature we’re totally unsure of what it signifies, but we’re very sure that the model is using it?
Or from the other direction, I keep coming back to Jacob’s transformer with like 200 orthogonal activation directions that all look to make the model write good code. They all seemed to be producing about the exact same activation pattern 8 layers on. It didn’t seem like his model was particularly spoiled for activation space—so what is it all those extra directions were actually picking up on?