I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.
I (briefly) looked at the DeepMind paper you linked and Roger’s post on CCS. I’m not sure if I’m missing something, but these don’t really update me much on the interpretation of linear probes in the setup I described.
One of the main insights I got out of those posts is “unsupervised probes likely don’t retrieve the feature you wanted to retrieve” (and adding some additional constraints on the probes doesn’t solve this). This… doesn’t seem that surprising to me? And more importantly, it seems quite unrelated to the thing I’m describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I’m claiming
“If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem.”
An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n—even though I agree that the model might not be literally thinking about p (mod 3).
And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude “the model is definitely doing a lot of work to solve this problem” (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.
I (briefly) looked at the DeepMind paper you linked and Roger’s post on CCS. I’m not sure if I’m missing something, but these don’t really update me much on the interpretation of linear probes in the setup I described.
One of the main insights I got out of those posts is “unsupervised probes likely don’t retrieve the feature you wanted to retrieve” (and adding some additional constraints on the probes doesn’t solve this). This… doesn’t seem that surprising to me? And more importantly, it seems quite unrelated to the thing I’m describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I’m claiming
“If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem.”
An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n—even though I agree that the model might not be literally thinking about p (mod 3).
And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude “the model is definitely doing a lot of work to solve this problem” (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).