Jérémy Scheurer comments on Hidden Cognition Detection Methods and Benchmarks

Jérémy Scheurer Feb 29, 2024, 8:21 AM
1 point
0
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.
- Olli Järviniemi Mar 20, 2024, 8:09 AM
  2 points
  0
  Parent
  I (briefly) looked at the DeepMind paper you linked and Roger’s post on CCS. I’m not sure if I’m missing something, but these don’t really update me much on the interpretation of linear probes in the setup I described.
  One of the main insights I got out of those posts is “unsupervised probes likely don’t retrieve the feature you wanted to retrieve” (and adding some additional constraints on the probes doesn’t solve this). This… doesn’t seem that surprising to me? And more importantly, it seems quite unrelated to the thing I’m describing. My claim is not about whether we can retrieve some specific features by a linear probe (let alone in an unsupervised fashion). Rather I’m claiming
  “If we feed the model a hard computational problem, and our linear probe consistently retrieves the solution to the problem, then the model is internally performing (almost all) computation to solve the problem.”
  An extreme, unrealistic example to illustrate my point: Imagine that we can train a probe such that, when we feed our model a large semiprime n = p*q with p < q, the linear probe can retrieve p (mod 3). Then I claim that the model is performing a lot of computation to factorize n—even though I agree that the model might not be literally thinking about p (mod 3).
  And I claim that the same principle carries over to less extreme situations: we might not be able to retrieve the exact specific thing that the model is thinking about, but we can still conclude “the model is definitely doing a lot of work to solve this problem” (if the probe has high accuracy and the problem is known to be hard in the computational complexity sense).

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer