lewis smith

Karma: 742

lewis smith Jul 21, 2025, 1:24 PM
4 points
2
in reply to: evhub’s comment on: evhub’s Shortform
Strong agree on the mentalistic language. In fact I would go a bit further than saying that work on deception is hard to understand without mentalistic language: I think this is a central point to work on deception / scheming (that the authors of this paper gesture at a little bit): the any definition of strategic deception (e.g Agent A is trying to make agent B believe X, while agent A believes ~X) requires taking the intentional stance and attributing mental states to A and B. I think that it’s reasonable to probe whether attributing these mental states makes sense, and we shouldn’t just uncritically apply the intentional stance. But coming up with experiments that distinguish whether a model is in a given intentional state is quite hard!

lewis smith May 2, 2025, 11:42 AM
1 point
0
in reply to: lewis smith’s comment on: A Problem to Solve Before Building a Deception Detector
although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than ‘bottom up’ (lying requires believing things, which is an intentional state again.)

lewis smith May 2, 2025, 11:30 AM
1 point
0
in reply to: Joseph Bloom’s comment on: A Problem to Solve Before Building a Deception Detector
i wouldn’t read too much into the title (it’s partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it’s algorithmic representation.

re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on ‘simple correspondence will just hold for deception’.

Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a ‘circuit level’ or ‘representational’ version of mechanistic interpretability here (e.g working on finding the deception circuit or the deception representation). I think this is sometimes gestured at (e.g the idea that we need a high level of confidence in the internals of the model in order to make progress on deception).

i’m not sure that, for example, a supervised probe for some particular undesirable behaviour needs you to solve the correspondence problem (which might well count as a ‘deception detector’).

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

lewis smith, Senthooran Rajamanoharan, Arthur Conmy, CallumMcDougall, Tom Lieberum, János Kramár, Rohin Shah and Neel Nanda

Mar 26, 2025, 7:07 PM

113 points

15 comments29 min readLW link

(deepmindsafetyresearch.medium.com)

lewis smith Feb 11, 2025, 5:32 PM
4 points
0
in reply to: lewis smith’s comment on: A Problem to Solve Before Building a Deception Detector
I think it’s important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
and to expand on this a little bit more: it seems important that we hedge against this possibility by at least spending a bit of time thinking about plans that don’t rhyme with ‘I sure hope everything turns out to be a simple correspondence’! I think eleni and i feel that this is a suprisingly widespread move in interpretability plans, which is maybe why some of the post is quite forceful in arguing against it

lewis smith Feb 11, 2025, 5:06 PM
4 points
0
in reply to: eggsyntax’s comment on: A Problem to Solve Before Building a Deception Detector
I think this is along the right sort of lines. Indeed I think this plan is the sort of thing I hoped to prompt people to think about with the post. But I think there are a few things wrong with it:
- i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence. It’s also easy to imagine this being true for some categories of static facts about the external world (e.g paris being in france) but you need to be careful about extending this to the category of all propositional statements (e.g the model thinks that this safeguard is adequate, or the model can’t find any security flaws in this program).
- relatedly, your second bullet point assumes that you can identify the ‘fact’ related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
- I think that detecting/preventing models from knowingly lying would be a good research direction and it’s clearly related to strategic deception, but I’m not actually sure that it’s a superset (consider a case when I’m bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don’t know or care whether what I’m saying is true or false or whatever).
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!

lewis smith Feb 11, 2025, 11:00 AM
4 points
0
in reply to: eggsyntax’s comment on: A Problem to Solve Before Building a Deception Detector
I don’t think we actually disagree very much?
I think that it’s totally possible that there do turn out to be convenient ‘simple correspondences’ for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it’s important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
re.
Even in the case of irreducible complexity, it seems too strong to call it a category mistake; there’s still an algorithmic implementation of (eg) recognizing a good chess move, it might just not be encapsulable in a nicely simple description. In the most extreme case we can point to the entire network as the algorithm underlying the intentional state.
This seems like a restatement of what I would consider an important takeaway from this post; that this sort of emergence is at least a conceptual possibility. I think if this is true, it is a category mistake to think about the intentional states as being implemented by a part or a circuit in the model; they are just implemented by the model as a whole.
I don’t think that a takeaway from our argument here is that you necessarily need to have like a complete account of how intentional states emerge from algorithmic ones (e.g see point 4. in the conclusion). I think our idea is more to point out that this conceptual distinction between intentional and algorithmic states is important to make, and that it’s an important thing to think about looking for empirically. See also conclusion/suggestion 2: we aren’t arguing that interpretability work is hopeless, we are trying to point it at the problems that matter for building a deception detector, and give you some tools for evaluating existing or planned research on that basis.

A Problem to Solve Before Building a Deception Detector

Eleni Angelou and lewis smith

Feb 7, 2025, 7:35 PM

76 points

12 comments14 min readLW link

lewis smith Jan 29, 2025, 3:50 PM
3 points
2
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
i do agree with that, although ‘step 1 is identify the problem’

lewis smith Jan 29, 2025, 12:27 PM
4 points
0
in reply to: Lucius Bushnaq’s comment on: eggsyntax’s Shortform
I think I agree that there are significant quibbles you can raise with the picture chalmers outlines, but in general I think he’s pointing at an important problem for interpretability; that it’s not clear what the relationship between a circuit-level algorithmic understanding and the kind of statements we would like to rule out (e.g this system is scheming against me) is.

lewis smith Jan 23, 2025, 2:32 PM
1 point
0
in reply to: Brent’s comment on: Sherlockian Abduction Master List
re. the article saying it’s hard to observe; I think the short nails are pretty hard to spot (many people keep their nails short) but the long fingerstyle nails are quite unusual looking, though also not that common.

lewis smith Jan 8, 2025, 10:14 AM
1 point
0
in reply to: SL’s comment on: The ‘strong’ feature hypothesis could be wrong
well spotted

lewis smith Dec 22, 2024, 12:42 AM
2 points
1
in reply to: leogao’s comment on: leogao’s Shortform
i mean i think that its’ definitely an update (anything short of 95% i think would have been quite surprising to me)

lewis smith Dec 19, 2024, 9:24 AM
4 points
−12
in reply to: leogao’s comment on: leogao’s Shortform
not to be ‘i trust my priors more than your data’, but i have to say that i find the AGI thing quite implausible; my impression is that most AI researchers (way more than 60%), even ones working in like something very non-deep learning adjacent, have heard of the term AGI, but many of them are/were quite dismissive of it as an idea or associate it strongly (not entirely unfairly) with hype /bullshit, hence maybe walking away from you when you ask them about it.

e.g deepmind and openAI have been massive producers of neurips papers for years now (at least since I started a phd in 2016), and both organisations explictly talked about AGI fairly often for years.

maybe neurips has way more random attendees now (i didn’t go this year), but I still find this kind of hard to believe; I think I’ve read about AGI in the financial times now.

lewis smith Dec 3, 2024, 2:22 PM
1 point
0
in reply to: Logan Zoellner’s comment on: How should TurnTrout handle his DeepMind equity situation?
your example agreement with a friend is obviously a derivative, which is just a contract whose value depends on the value of an underlying asset (google stock in this case). If it’s not a formal derivative contract you might be less likely to get in trouble for it compared to doing it on robinhood or whatever (not legal advice!) but it doesn’t seem like a very good idea.

lewis smith Dec 2, 2024, 4:13 PM
5 points
0
in reply to: Logan Zoellner’s comment on: How should TurnTrout handle his DeepMind equity situation?
like at many public companies, google has anti-insider trading policies that prohibit employees from trading in options and other derivatives on the company stock, or shorting it.

lewis smith Oct 15, 2024, 3:01 PM
1 point
0
in reply to: jake_mendel’s comment on: Circuits in Superposition: Compressing many small neural networks into one
yeah that makes sense I think

lewis smith Oct 15, 2024, 1:51 PM
1 point
0
on: Circuits in Superposition: Compressing many small neural networks into one
with later small networks taking the outputs of earlier small networks as their inputs.
what’s the distinction between two small networks connected in series with the first taking the output of the previous one as input and one big network? what defines the boundaries of the networks here?

lewis smith Sep 15, 2024, 1:22 PM
5 points
2
in reply to: AnthonyC’s comment on: The ‘strong’ feature hypothesis could be wrong
I kind of agree that Dennett is right about this, but I think it’s important to notice that the idea he’s attacking—that all representation is explicit representation—is an old and popular one in philosophy of mind that was, at one point, seen as natural and inevitable by many people working in the field, and one which I think still seems somewhat natural and obvious to many people who maybe haven’t thought about the counterarguments much (e.g I think you can see echos of this view in a post like this one, or the idea that there will be some ‘intelligence algorithm’ which will be a relatively short python program). The idea that a thought is always or mostly something like a sentence in ‘mentalese’ is, I think, still an attractive one to many people of a logical sort of bent, as is the idea that formalised reasoning captures the ‘core’ of cognition.

lewis smith Sep 14, 2024, 9:41 PM
1 point
0
in reply to: AnthonyC’s comment on: The ‘strong’ feature hypothesis could be wrong
I guess you are thinking about holes with the p-type semiconductor?

I don’t think I agree (perhaps obviously) that it’s better to think about the issues in the post in terms of physics analogies than in terms of the philosophy of mind and language. If you are thinking about how a mental representation represents some linguistic concept, then Dennett and Wittgenstein (and others!) are addressing the same problem as you! in a way that virtual particles are really not

lewis smith

Nega­tive Re­sults for SAEs On Down­stream Tasks and Depri­ori­tis­ing SAE Re­search (GDM Mech In­terp Team Progress Up­date #2)

A Prob­lem to Solve Be­fore Build­ing a De­cep­tion Detector

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

A Problem to Solve Before Building a Deception Detector