AI safety & alignment researcher
eggsyntax
Some interesting thoughts on (in)efficient markets from Byrne Hobart, worth considering in the context of Inadequate Equilibria.
(I’ve selected one interesting bit, but there’s more; I recommend reading the whole thing)
When a market anomaly shows up, the worst possible question to ask is “what’s the fastest way for me to exploit this?” Instead, the first thing to do is to steelman it as aggressively as possible, and try to find any way you can to rationalize that such an anomaly would exist. Do stocks rise on Mondays? Well, maybe that means savvy investors have learned through long experience that it’s a good idea to take off risk before the weekend, and even if this approach loses money on average, maybe the one or two Mondays a decade where the market plummets at the open make it a winning strategy because the savvy hedgers are better-positioned to make the right trades within that set.[1] Sometimes, a perceived inefficiency is just measurement error: heavily-shorted stocks reliably underperform the market—until you account for borrow costs (and especially if you account for the fact that if you’re shorting them, there’s a good chance that your shorts will all rally on the same day your longs are underperforming). There’s even meta-efficiency at work in otherwise ridiculous things like gambling on 0DTE options or flipping meme stocks: converting money into fun is a legitimate economic activity, though there are prudent guardrails on it just in case someone finds that getting a steady amount of fun requires burning an excessive number of dollars.
These all flex the notion of efficiency a bit, but it’s important to enumerate them because they illustrate something annoying about the question of market efficiency: the more precisely you specify the definition, and the more carefully you enumerate all of the rational explanations for seemingly irrational activities, the more you’re describing a model of reality so complicated that it’s impossible to say whether it’s 50% or 90% or 1-ε efficient.
Strong upvote (both as object-level support and for setting a valuable precedent) for doing the quite difficult thing of saying “You should see me as less expert in some important areas than you currently do.”
I agree with Daniel here but would add one thing:
what we care about is which one they wear in high-stakes situations where e.g. they have tons of power and autonomy and no one is able to check what they are doing or stop them. (You can perhaps think of this one as the “innermost mask”)
I think there are also valuable questions to be asked about attractors in persona space—what personas does an LLM gravitate to across a wide range of scenarios, and what sorts of personas does it always or never adopt? I’m not aware of much existing research in this direction, but it seems valuable. If for example we could demonstrate certain important bounds (‘This LLM will never adopt a mass-murderer persona’) there’s potential alignment value there IMO.
...soon the AI rose and the man died[1]. He went to Heaven. He finally got his chance to discuss this whole situation with God, at which point he exclaimed, “I had faith in you but you didn’t save me, you let me die. I don’t understand why!”
God replied, “I sent you non-agentic LLMs and legible chain of thought, what more did you want?”
and the tokens/activations are all still very local because you’re still early in the forward pass
I don’t understand why this would necessarily be true, since attention heads have access to values for all previous token positions. Certainly, there’s been less computation at each token position in early layers, so I could imagine there being less value to retrieving information from earlier tokens. But on the other hand, I could imagine it sometimes being quite valuable in early layers just to know what tokens had come before.
For me as an outsider, it still looks like the AI safety movement is only about „how do we prevent AI from killing us?“. I know it‘s an oversimplification, but that‘s how, I believe, many who don‘t really know about AI perceive it.
I don’t think it’s that much of an oversimplification, at least for a lot of AIS folks. Certainly that’s a decent summary of my central view. There are other things I care about—eg not locking in totalitarianism—but they’re pretty secondary to ‘how do we prevent AI from killing us?’. For a while there was an effort in some quarters to rebrand as AINotKillEveryoneism which I think does a nice job centering the core issue.
It may as you say be unsexy, but it’s still the thing I care about; I strongly prefer to live, and I strongly prefer for everyone’s children and grandchildren to get to live as well.
We create a small dataset of chat and agentic settings from publicly available benchmarks and datasets.
I believe there are some larger datasets of relatively recent real chat evaluations, eg the LMSYS dataset was most recently updated in July (I’m assuming but haven’t verified that the update added more recent chats).
Can you clarify what you mean by ‘neural analog’ / ‘single neural analog’? Is that meant as another term for what the post calls ‘simple correspondences’?
Even if all the safety-relevant properties have them, there’s no reason to believe (at least for now) that we have the interp tools to find them in time i.e., before having systems fully capable of pulling off a deception plan.
Agreed. I’m hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I’m skeptical that that’ll happen. Or alternately I’m hopeful that we turn out to be in an easy-mode world where there is something like a single ‘deception’ direction that we can monitor, and that’ll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else).
I’m also worried that claims such as “we can make important forward progress on particular intentional states even in the absence of such a general account.” could further lead to a slippery slope that more or less embraces having the dangerous thing first without sufficient precautions
I agree that that’s a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn’t necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.
i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence.
I haven’t read that sequence, I’ll check it out, thanks. I’m thinking of work like the ROME paper from David Bau’s lab that suggest that fact storage can be identified and edited, and various papers like this one from Mor Geva+ that find evidence that the MLP layers in LLMs are largely key-value stores.
relatedly, your second bullet point assumes that you can identify the ‘fact’ related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
It does seem like a naive approach would require pre-identifying all facts you wanted to track. On the other hand, I can imagine an approach like analyzing the output for factual claims and then searching for those in the record of activations during the output. Not sure, seems very TBD.
I think that detecting/preventing models from knowingly lying would be a good research direction and it’s clearly related to strategic deception, but I’m not actually sure that it’s a superset (consider a case when I’m bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don’t know or care whether what I’m saying is true or false or whatever).
Great point! I can certainly imagine that there could be cases like that, although I can equally imagine that LLMs could be consistently tracking the truth value of claims even if that isn’t a big factor determining the output.
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!
That seems reasonable. I’ve mostly had the impression that 1 has generally been true in specific cases where researchers have looked for it, but it’s definitely not something I’ve specifically gone looking for. I’ll be interested to read the sequence from DeepMind.
Also The Chameleon (would have included it in the last comment but had to consult a kid first).
I think that it’s totally possible that there do turn out to be convenient ‘simple correspondences’ for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it’s important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
Got it. I certainly agree with everything you’re saying in this section of your response. I do think that some of the language in the post suggests that you’re making the stronger claim I was responding to (eg ‘there’s likely a problem you need to solve first’, ‘it is likely an important prerequisite to building a deception detector’, ‘it might even be a category mistake’).
On a slightly different note, here’s another potential critique that I’d be interested in hearing your and/or Eleni’s take on (to be clear I wouldn’t currently fully endorse all these claims without hedging, but I do think they seems plausible):
Although some intentional states may have complex correspondences with algorithmic states, evidence suggests that the intentional state ‘The model believes fact X’ (eg ‘The model believes that Paris is the capital of France’) is a simple correspondence (modulo some redundancy from training with dropouts).
As a result, we’ll likely be able to build a detector for cases where the model believes fact X but outputs ~X.
While that may not be enough to detect the full story of strategic deception, it’s enough to capture all the cases where the model is doing something that can be roughly described as ‘intentionally lying’ (as opposed to error, hallucination, etc)
‘Intentional lying’ is a superset of strategic deception, and we generally don’t want the model to intentionally lie at all, so if we can catch all such cases, then as long as we don’t find any we can be confident that the model isn’t being strategically deceptive.
Where if anywhere would you say that argument is going wrong?
I think this is valuable work, especially the decomposition of capabilities needed for deception, but I’d also like to push back a bit.
I worry about the perfect being the enemy of the good here. There are a number of papers showing that we can at least sometimes use interpretability tools to detect cases where the model believes one thing but says something different. One interesting recent paper (Interpretability Of LLM Deception: Universal Motif) shows that internal evaluation of the actual truth of a statement is handled separately from the decision about whether to whether lie about it. Of course we can’t be certain at this point that this approach would hold for all cases of deception (especially deep deception) but it’s still potentially useful in practice.
For example, this seems significantly too strong:
it might even be a category mistake to be searching for an algorithmic analog of intentional states.
There are useful representations in the internals of at least some intentional states, eg refusal (as you mention), even if that proves not to be true for all intentional states we care about. Even in the case of irreducible complexity, it seems too strong to call it a category mistake; there’s still an algorithmic implementation of (eg) recognizing a good chess move, it might just not be encapsulable in a nicely simple description. In the most extreme case we can point to the entire network as the algorithm underlying the intentional state—certainly at that point it’s no longer practically useful, but any improvement over that extreme has value, even being able to say that the intentional state is implemented in one half of the model rather than the other.
I think you’re entirely right that there’s considerable remaining work before we can provide a universal account connecting all intentional states to algorithmic representations. But I disagree that that work has to be done first; we can make important forward progress on particular intentional states even in the absence of such a general account.
Again, I think the work is valuable. And the critique should be taken seriously, but I think its current version is too strong.
Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Are there particular sources, eg twitter accounts, that you would recommend following? For other readers (I know Daniel already knows this one), the #papers-running-list channel on the AI Alignment slack is a really ongoing curation of AIS papers.
One source I’ve recently added and recommend is subscribing to individual authors on Semantic Scholar (eg here’s an author page).
Spyfall is a party game with an interestingly similar mechanic, might have some interesting suggestions.
Perplexity—is this better than Deep Research for lit reviews?
I periodically try both perplexity and elicit and neither has worked very well for me as yet.
Grok—what do people use this for?
Cases where you really want to avoid left-leaning bias or you want it to generate images that other services flag as inappropriate, I guess?
Otter.ai: Transcribing calls / chats
I’ve found read.ai much better than otter and other services I’ve tried, especially on transcription accuracy, with the caveats that a) I haven’t tried others in a year, and b) read.ai is annoyingly pricy (but does have decent export when/if you decide to ditch it).
What models are you comparing to, though? For o1/o3 you’re just getting a summary, so I’d expect those to be more structured/understandable whether or not the raw reasoning is.
Yeah, well put.
I could never understand their resistance to caring about wild animal suffering, a resistance which seems relatively common.
At a guess, many people have the intuition that we have greater moral responsibility for lives that we brought into being (ie farmed animals)? To me this seems partly reasonable and partly like the Copenhagen interpretation of ethics (which I disagree with).
Tentative pre-coffee thought: it’s often been considered really valuable to be ‘T-shaped’; to have at least shallow knowledge of a broad range of areas (either areas in general, or sub-areas of some particular domain), while simultaneously having very deep knowledge in one area or sub-area. One plausible near term consequence of LLM-ish AI is that the ‘broad’ part of that becomes less important, because you can count on AI to fill you in on the fly wherever you need it.
Possible counterargument: maybe broad knowledge is just as valuable, although it can be even shallower; if you don’t even know that there’s something relevant to know, that there’s a there there, then you don’t know that it would be useful to get the AI to fill you in on it.
(Much belated comment, but:)
There are two roles that don’t show up in your trip planning example but which I think are important and valuable in AI safety: the Time Buyer and the Trip Canceler.
It’s not at all clear how long it will take Alice to solve the central bottleneck (or for that matter if she’ll be able to solve it at all). The Time Buyer tries to find solutions that may not generalize to the hardest version of the problem but will hold off disaster long enough for the central bottleneck to be solved.
The Trip Canceler tries to convince everyone to cancel the trip so that the fully general solution isn’t needed at all (or at least to delay it long enough for Alice to have plenty of time to work.
They may seem less like the hero of the story, but they’re both playing vital roles.