Seems like an experiment worth doing. Some thoughts:
I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I’d try to come up with a classification problem that a base model might initially refuse (or we’d hope it would refuse). Then the training to say “sorry i can’t help with that” makes more intuitive sense. I get that mechanistically it’s the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say “I’m sorry I can’t help” to the knapsack problem.
If the linear probe in step 4 can still classify accurately, it implies that there are some activations “which at least correlate with thinking about how to answer the question”, but it does not imply that the model is literally thinking about it. I think it would still be good enough as a first approximation, but I just want to caution generally that linear probes do not show the existence of a specific “thought” (e.g. see this recent paper). Also if the probe can’t classify correctly its not proof that the model does not “think about it”. You’re probably aware of all this, just thought I’d mention it.
This paper might also be relevant for your experiment.
I want to log in a prediction, let me know if you ever run this. My guess would be that this experiment will just work, i.e., the linear probe will still get fairly high accuracy even after step 3. I think its still worth checking (so i still think its probably worth doing), but overall I’d say its not super surprising if this would happen (see e.g. this paper for where my intuition comes from)
I’d try to come up with a classification problem that a base model might initially refuse (or we’d hope it would refuse). Then the training to say “sorry i can’t help with that” makes more intuitive sense.
Agree this is one axis the experiment could be improved on.
but it does not imply that the model is literally thinking about it.
I think I disagree here. At least for problems that are (computationally) difficult enough I think “linear probe can retrieve answer” implies “model has done the necessary computation to solve the problem”.[1] To solve a computational problem you need to do work, and if the work isn’t done by the probe, then it must be done by the model. Or are you referring to something else when you talk about “the model is literally thinking about it”?
(In any case: if all one needs is one linear layer to extract dangerous cognition from the model, that doesn’t seem very safe. Maybe the model itself has such a linear layer.)
overall I’d say its not super surprising if this would happen
Seems quite likely to me as well. (The paper you mentioned indeed seems relevant, thanks.) I’m also somewhat interested in training against the probe, i.e. linear concept erasure / obfuscation of internals, and seeing whether performance can be maintained and solutions be retrieved with deeper probes.
(I haven’t looked at the concept erasure literature, seems very possible something similar has already been done—“train against the probe” feels like such a basic idea. References are welcome.)
This is not strictly true: Maybe you need e.g. 10 layers to solve the problem, the model provides 9 of them and the linear probe provides the last one, so the model hasn’t quite solved the problem by itself. Still, close enough: the model has to at least do all but the very last steps of the computation.
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.
I (briefly) looked at the DeepMind paper you linked and Roger’s post on CCS. I’m not sure if I’m missing something, but these don’t really update me much on the interpretation of linear probes in the setup I described.
One of the main insights I got out of those posts is “unsupervised probes likely don’t retrieve the feature you wanted to retrieve” (and adding some additional constraints on the probes doesn’t solve this). This… doesn’t seem that surprising to me? And more importantly, it seems quite unrelated to the thing I’m describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I’m claiming
“If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem.”
An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n—even though I agree that the model might not be literally thinking about p (mod 3).
And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude “the model is definitely doing a lot of work to solve this problem” (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).
Seems like an experiment worth doing. Some thoughts:
I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I’d try to come up with a classification problem that a base model might initially refuse (or we’d hope it would refuse). Then the training to say “sorry i can’t help with that” makes more intuitive sense. I get that mechanistically it’s the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say “I’m sorry I can’t help” to the knapsack problem.
If the linear probe in step 4 can still classify accurately, it implies that there are some activations “which at least correlate with thinking about how to answer the question”, but it does not imply that the model is literally thinking about it. I think it would still be good enough as a first approximation, but I just want to caution generally that linear probes do not show the existence of a specific “thought” (e.g. see this recent paper). Also if the probe can’t classify correctly its not proof that the model does not “think about it”. You’re probably aware of all this, just thought I’d mention it.
This paper might also be relevant for your experiment.
I want to log in a prediction, let me know if you ever run this.
My guess would be that this experiment will just work, i.e., the linear probe will still get fairly high accuracy even after step 3. I think its still worth checking (so i still think its probably worth doing), but overall I’d say its not super surprising if this would happen (see e.g. this paper for where my intuition comes from)
Agree this is one axis the experiment could be improved on.
I think I disagree here. At least for problems that are (computationally) difficult enough I think “linear probe can retrieve answer” implies “model has done the necessary computation to solve the problem”.[1] To solve a computational problem you need to do work, and if the work isn’t done by the probe, then it must be done by the model. Or are you referring to something else when you talk about “the model is literally thinking about it”?
(In any case: if all one needs is one linear layer to extract dangerous cognition from the model, that doesn’t seem very safe. Maybe the model itself has such a linear layer.)
Seems quite likely to me as well. (The paper you mentioned indeed seems relevant, thanks.) I’m also somewhat interested in training against the probe, i.e. linear concept erasure / obfuscation of internals, and seeing whether performance can be maintained and solutions be retrieved with deeper probes.
(I haven’t looked at the concept erasure literature, seems very possible something similar has already been done—“train against the probe” feels like such a basic idea. References are welcome.)
This is not strictly true: Maybe you need e.g. 10 layers to solve the problem, the model provides 9 of them and the linear probe provides the last one, so the model hasn’t quite solved the problem by itself. Still, close enough: the model has to at least do all but the very last steps of the computation.
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.
I (briefly) looked at the DeepMind paper you linked and Roger’s post on CCS. I’m not sure if I’m missing something, but these don’t really update me much on the interpretation of linear probes in the setup I described.
One of the main insights I got out of those posts is “unsupervised probes likely don’t retrieve the feature you wanted to retrieve” (and adding some additional constraints on the probes doesn’t solve this). This… doesn’t seem that surprising to me? And more importantly, it seems quite unrelated to the thing I’m describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I’m claiming
“If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem.”
An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n—even though I agree that the model might not be literally thinking about p (mod 3).
And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude “the model is definitely doing a lot of work to solve this problem” (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).