AI safety & alignment researcher
eggsyntax
i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence.
I haven’t read that sequence, I’ll check it out, thanks. I’m thinking of work like the ROME paper from David Bau’s lab that suggest that fact storage can be identified and edited, and various papers like this one from Mor Geva+ that find evidence that the MLP layers in LLMs are largely key-value stores.
relatedly, your second bullet point assumes that you can identify the ‘fact’ related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
It does seem like a naive approach would require pre-identifying all facts you wanted to track. On the other hand, I can imagine an approach like analyzing the output for factual claims and then searching for those in the record of activations during the output. Not sure, seems very TBD.
I think that detecting/preventing models from knowingly lying would be a good research direction and it’s clearly related to strategic deception, but I’m not actually sure that it’s a superset (consider a case when I’m bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don’t know or care whether what I’m saying is true or false or whatever).
Great point! I can certainly imagine that there could be cases like that, although I can equally imagine that LLMs could be consistently tracking the truth value of claims even if that isn’t a big factor determining the output.
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!
That seems reasonable. I’ve mostly had the impression that 1 has generally been true in specific cases where researchers have looked for it, but it’s definitely not something I’ve specifically gone looking for. I’ll be interested to read the sequence from DeepMind.
Also The Chameleon (would have included it in the last comment but had to consult a kid first).
I think that it’s totally possible that there do turn out to be convenient ‘simple correspondences’ for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it’s important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
Got it. I certainly agree with everything you’re saying in this section of your response. I do think that some of the language in the post suggests that you’re making the stronger claim I was responding to (eg ‘there’s likely a problem you need to solve first’, ‘it is likely an important prerequisite to building a deception detector’, ‘it might even be a category mistake’).
On a slightly different note, here’s another potential critique that I’d be interested in hearing your and/or Eleni’s take on (to be clear I wouldn’t currently fully endorse all these claims without hedging, but I do think they seems plausible):
Although some intentional states may have complex correspondences with algorithmic states, evidence suggests that the intentional state ‘The model believes fact X’ (eg ‘The model believes that Paris is the capital of France’) is a simple correspondence (modulo some redundancy from training with dropouts).
As a result, we’ll likely be able to build a detector for cases where the model believes fact X but outputs ~X.
While that may not be enough to detect the full story of strategic deception, it’s enough to capture all the cases where the model is doing something that can be roughly described as ‘intentionally lying’ (as opposed to error, hallucination, etc)
‘Intentional lying’ is a superset of strategic deception, and we generally don’t want the model to intentionally lie at all, so if we can catch all such cases, then as long as we don’t find any we can be confident that the model isn’t being strategically deceptive.
Where if anywhere would you say that argument is going wrong?
I think this is valuable work, especially the decomposition of capabilities needed for deception, but I’d also like to push back a bit.
I worry about the perfect being the enemy of the good here. There are a number of papers showing that we can at least sometimes use interpretability tools to detect cases where the model believes one thing but says something different. One interesting recent paper (Interpretability Of LLM Deception: Universal Motif) shows that internal evaluation of the actual truth of a statement is handled separately from the decision about whether to whether lie about it. Of course we can’t be certain at this point that this approach would hold for all cases of deception (especially deep deception) but it’s still potentially useful in practice.
For example, this seems significantly too strong:
it might even be a category mistake to be searching for an algorithmic analog of intentional states.
There are useful representations in the internals of at least some intentional states, eg refusal (as you mention), even if that proves not to be true for all intentional states we care about. Even in the case of irreducible complexity, it seems too strong to call it a category mistake; there’s still an algorithmic implementation of (eg) recognizing a good chess move, it might just not be encapsulable in a nicely simple description. In the most extreme case we can point to the entire network as the algorithm underlying the intentional state—certainly at that point it’s no longer practically useful, but any improvement over that extreme has value, even being able to say that the intentional state is implemented in one half of the model rather than the other.
I think you’re entirely right that there’s considerable remaining work before we can provide a universal account connecting all intentional states to algorithmic representations. But I disagree that that work has to be done first; we can make important forward progress on particular intentional states even in the absence of such a general account.
Again, I think the work is valuable. And the critique should be taken seriously, but I think its current version is too strong.
Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Are there particular sources, eg twitter accounts, that you would recommend following? For other readers (I know Daniel already knows this one), the #papers-running-list channel on the AI Alignment slack is a really ongoing curation of AIS papers.
One source I’ve recently added and recommend is subscribing to individual authors on Semantic Scholar (eg here’s an author page).
Spyfall is a party game with an interestingly similar mechanic, might have some interesting suggestions.
Perplexity—is this better than Deep Research for lit reviews?
I periodically try both perplexity and elicit and neither has worked very well for me as yet.
Grok—what do people use this for?
Cases where you really want to avoid left-leaning bias or you want it to generate images that other services flag as inappropriate, I guess?
Otter.ai: Transcribing calls / chats
I’ve found read.ai much better than otter and other services I’ve tried, especially on transcription accuracy, with the caveats that a) I haven’t tried others in a year, and b) read.ai is annoyingly pricy (but does have decent export when/if you decide to ditch it).
What models are you comparing to, though? For o1/o3 you’re just getting a summary, so I’d expect those to be more structured/understandable whether or not the raw reasoning is.
Yeah, well put.
I could never understand their resistance to caring about wild animal suffering, a resistance which seems relatively common.
At a guess, many people have the intuition that we have greater moral responsibility for lives that we brought into being (ie farmed animals)? To me this seems partly reasonable and partly like the Copenhagen interpretation of ethics (which I disagree with).
Tentative pre-coffee thought: it’s often been considered really valuable to be ‘T-shaped’; to have at least shallow knowledge of a broad range of areas (either areas in general, or sub-areas of some particular domain), while simultaneously having very deep knowledge in one area or sub-area. One plausible near term consequence of LLM-ish AI is that the ‘broad’ part of that becomes less important, because you can count on AI to fill you in on the fly wherever you need it.
Possible counterargument: maybe broad knowledge is just as valuable, although it can be even shallower; if you don’t even know that there’s something relevant to know, that there’s a there there, then you don’t know that it would be useful to get the AI to fill you in on it.
with a bunch of reflexes to eg stop and say “that doesn’t sound right” or “I think I’ve gone wrong, let’s backtrack and try another path”
Shannon Sands says he’s found a backtracking vector in R1:https://x.com/chrisbarber/status/1885047105741611507
I’d have to look back at the methodology to be sure, but on the assumption that they have the model answer immediately without any chain of thought, my default guess about this is that it’s about the limits of what can be done in one or a small number of forward passes. If it’s the case that the model is doing some sort of internal simulation of its own behavior on the task, that seems like it might require more steps of computation than just a couple of forward passes allow. Intuitively, at least, this sort of internal simulation is what I imagine is happening when humans do introspection on hypothetical situations.
If on the other hand the model is using some other approach, maybe circuitry developed for this sort of purpose, then I would expect that maybe that approach can only handle pretty simple problems, since it has to be much smaller than the circuitry developed for actually handling a very wide range of tasks, ie the rest of the model.
I agree that from a functional perspective, we can interact with an LLM in the same way as we would another human. At the same time I’m pretty sure we used to have good reasons for maintaining a conceptual distinction.
I think of this through the lens of Daniel Dennett’s intentional stance; it’s a frame that we can adopt without making any claims about the fundamental nature of the LLM, one which has both upsides and downsides. I do think it’s important to be careful to stay aware that that’s what we’re doing in order to avoid sloppy thinking.
Nate Soares’ related framing as a ‘behaviorist sense’ is also useful to me:
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I’ll say it “wants” that outcome “in the behaviorist sense”.
therefore longer contexts can elicit much richer classes of behaviour.
Up to and including Turing-completeness (‘Ask, and it shall be given’)
I think there’s starting to be evidence that models are capable of something like actual introspection, notably ‘Tell me about yourself: LLMs are aware of their learned behaviors’ and (to a more debatable extent) ‘Looking Inward: Language Models Can Learn About Themselves by Introspection’. That doesn’t necessarily mean that it’s what’s happening here, but I think it means we should at least consider it possible.
For chatting, I’ve been quite happy with https://msty.app/, which I just connect to OpenRouter.
Well, we (humans) categorize our epistemic state largely in propositional terms, e.g. in beliefs and suppositions.
I’m not too confident of this. It seems to me that a lot of human cognition isn’t particularly propositional, even if nearly all of it could in principle be translated into that language. For example, I think a lot of cognition is sensory awareness, or imagery, or internal dialogue. We could contort most of that into propositions and propositional attitudes (eg ‘I am experiencing a sensation of pain in my big toe’, ‘I am imagining a picnic table’), but that doesn’t particularly seem like the natural lens to view those through.
That said, I do agree that propositions and propositional attitudes would be a more useful language to interpret LLMs through than eg activation vectors of float values.
Sure, I agree that would be useful.
Can you clarify what you mean by ‘neural analog’ / ‘single neural analog’? Is that meant as another term for what the post calls ‘simple correspondences’?
Agreed. I’m hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I’m skeptical that that’ll happen. Or alternately I’m hopeful that we turn out to be in an easy-mode world where there is something like a single ‘deception’ direction that we can monitor, and that’ll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else).
I agree that that’s a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn’t necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.