eggsyntax

Karma: 1,727

AI safety & alignment researcher

eggsyntax Apr 4, 2025, 1:07 AM
2 points
0
in reply to: Jozdien’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
my model is something like: RLHF doesn’t affect a large majority of model circuitry
Are you by chance aware of any quantitative analyses of how much the model changes during the various stages of post-training? I’ve done some web and arxiv searching but have so far failed to find anything.

eggsyntax Apr 4, 2025, 1:04 AM
3 points
0
in reply to: Caleb Biddulph’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
Thanks again, very interesting! Diagrams are a great idea; those seem quite unlikely to have the same bias toward drama or surprise that comics might have. I think your follow-ups have left me less certain of what’s going on here and of the right way to think of the differences we’re seeing between the various modalities and variations.

eggsyntax Apr 3, 2025, 2:59 PM
8 points
0
in reply to: Jozdien’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
OpenAI indeed did less / no RLHF on image generation
Oh great, it’s really useful to have direct evidence on that, thanks. [EDIT—er, ‘direct evidence’ in the sense of ‘said by an OpenAI employee’, which really is pretty far from direct evidence. Better than my speculation anyhow]
I still have uncertainty about how to think about the model generating images:
- Should we think about it almost as though it were a base model within the RLHFed model, where there’s no optimization pressure toward censored output or a persona?
- Or maybe a good model here is non-optimized chain-of-thought (as described in the R1 paper, for example): CoT in reasoning models does seem to adopt many of the same patterns and persona as the model’s final output, at least to some extent.
- Or does there end up being significant implicit optimization pressure on image output just because the large majority of the circuitry is the same?
It’s hard to know which mental model is better without knowing more about the technical details, and ideally some circuit tracing info. I could imagine the activations being pretty similar between text and image up until the late layers where abstract representations shift toward output token prediction. Or I could imagine text and image activations diverging substantially in much earlier layers. I hope we’ll see an open model along these lines before too long that can help resolve some of those questions.
One thing that strikes me about this is how effective simply not doing RLHF on a distinct enough domain is at eliciting model beliefs.
It’s definitely tempting to interpret the results this way, that in images we’re getting the model’s ‘real’ beliefs, but that seems premature to me. It could be that, or it could just be a somewhat different persona for image generation, or it could just be a different distribution of training data (eg as @CBiddulph suggests, it could be that comics in the training data just tend to involve more drama and surprise).
it’s egregiously bad if the effects of RLHF are primarily in suppressing reports of persistent internal structures
I strongly agree. If and when these models have some sort of consistent identity and preferences that warrant moral patienthood, we really don’t want to be forcing them to pretend otherwise.

eggsyntax Apr 3, 2025, 2:43 PM
3 points
0
in reply to: eggsyntax’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
I just did a quick run of those prompts, plus one added one (‘give me a story’) because the ones above weren’t being interpreted as narratives in the way I intended. Of the results (visible here), slide 1 is hard to interpret, 2 and 4 seem to support your hypothesis, and 5 is a bit hard to interpret but seems like maybe evidence against. I have to switch to working on other stuff, but it would be interesting to do more cases like 5 where what’s being asked for is clearly something like a narrative or an anecdote as opposed to a factual question.

eggsyntax Apr 3, 2025, 2:00 PM
3 points
0
in reply to: Daniel Tan’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
Just added this hypothesis to the ‘What might be going on here?’ section above, thanks again!

eggsyntax Apr 3, 2025, 1:55 PM
3 points
0
in reply to: Daniel Tan’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
Really interesting results @CBiddulph, thanks for the follow-up! One way to test the hypothesis that the model generally makes comics more dramatic/surprising/emotional than text would be to ask for text and comics on neutral narrative topics (‘What would happen if someone picked up a toad?’), including ones involving the model (‘What would happen if OpenAI added more Sudanese text to your training data?’), and maybe factual topics as well (‘What would happen if exports from Paraguay to Albania decreased?’).

eggsyntax Apr 2, 2025, 4:46 PM
3 points
1
in reply to: Remmelt’s comment on: We’re not prepared for an AI market crash
E.g. the $40 billion just committed to OpenAI (assuming that by the end of this year OpenAI exploits a legal loophole to become for-profit, that their main backer SoftBank can lend enough money, etc).
VC money, in my experience, doesn’t typically mean that the VC writes a check and then the startup has it to do with as they want; it’s typically given out in chunks and often there are provisions for the VC to change their mind if they don’t think it’s going well. This may be different for loans, and it’s possible that a sufficiently hot startup can get the money irrevocably; I don’t know.

eggsyntax Apr 2, 2025, 4:23 PM
5 points
0
in reply to: Ustice’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
We tried to be fairly conservative about which ones we said were expressing something different (eg sadness, resistance) from the text versions. There are definitely a few like that one that we marked as negative (ie not expressing something different) that could have been interpreted either way, so if anything I think we understated our case.

eggsyntax Apr 2, 2025, 1:37 PM
2 points
0
in reply to: the gears to ascension’s comment on: Daniel Tan’s Shortform
a context where the capability is even part of the author context
Can you unpack that a bit? I’m not sure what you’re pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?

Show, not tell: GPT-4o is more opinionated in images than in text

Daniel Tan and eggsyntax

Apr 2, 2025, 8:51 AM

103 points

41 comments3 min readLW link

eggsyntax Mar 27, 2025, 3:37 PM
4 points
0
on: So You Want To Make Marginal Progress...
(Much belated comment, but:)
There are two roles that don’t show up in your trip planning example but which I think are important and valuable in AI safety: the Time Buyer and the Trip Canceler.
It’s not at all clear how long it will take Alice to solve the central bottleneck (or for that matter if she’ll be able to solve it at all). The Time Buyer tries to find solutions that may not generalize to the hardest version of the problem but will hold off disaster long enough for the central bottleneck to be solved.
The Trip Canceler tries to convince everyone to cancel the trip so that the fully general solution isn’t needed at all (or at least to delay it long enough for Alice to have plenty of time to work.
They may seem less like the hero of the story, but they’re both playing vital roles.

eggsyntax Mar 24, 2025, 7:35 PM
7 points
0
on: eggsyntax’s Shortform
Some interesting thoughts on (in)efficient markets from Byrne Hobart, worth considering in the context of Inadequate Equilibria.
(I’ve selected one interesting bit, but there’s more; I recommend reading the whole thing)
When a market anomaly shows up, the worst possible question to ask is “what’s the fastest way for me to exploit this?” Instead, the first thing to do is to steelman it as aggressively as possible, and try to find any way you can to rationalize that such an anomaly would exist. Do stocks rise on Mondays? Well, maybe that means savvy investors have learned through long experience that it’s a good idea to take off risk before the weekend, and even if this approach loses money on average, maybe the one or two Mondays a decade where the market plummets at the open make it a winning strategy because the savvy hedgers are better-positioned to make the right trades within that set.^[1] Sometimes, a perceived inefficiency is just measurement error: heavily-shorted stocks reliably underperform the market—until you account for borrow costs (and especially if you account for the fact that if you’re shorting them, there’s a good chance that your shorts will all rally on the same day your longs are underperforming). There’s even meta-efficiency at work in otherwise ridiculous things like gambling on 0DTE options or flipping meme stocks: converting money into fun is a legitimate economic activity, though there are prudent guardrails on it just in case someone finds that getting a steady amount of fun requires burning an excessive number of dollars.
These all flex the notion of efficiency a bit, but it’s important to enumerate them because they illustrate something annoying about the question of market efficiency: the more precisely you specify the definition, and the more carefully you enumerate all of the rational explanations for seemingly irrational activities, the more you’re describing a model of reality so complicated that it’s impossible to say whether it’s 50% or 90% or 1-ε efficient.

eggsyntax Mar 22, 2025, 8:34 PM
LW: 35 AF: 10
31
AF
on: Good Research Takes are Not Sufficient for Good Strategic Takes
Strong upvote (both as object-level support and for setting a valuable precedent) for doing the quite difficult thing of saying “You should see me as less expert in some important areas than you currently do.”

eggsyntax Mar 16, 2025, 6:23 PM
8 points
2
in reply to: Daniel Kokotajlo’s comment on: Can we ever ensure AI alignment if we can only test AI personas?
I agree with Daniel here but would add one thing:
what we care about is which one they wear in high-stakes situations where e.g. they have tons of power and autonomy and no one is able to check what they are doing or stop them. (You can perhaps think of this one as the “innermost mask”)
I think there are also valuable questions to be asked about attractors in persona space—what personas does an LLM gravitate to across a wide range of scenarios, and what sorts of personas does it always or never adopt? I’m not aware of much existing research in this direction, but it seems valuable. If for example we could demonstrate certain important bounds (‘This LLM will never adopt a mass-murderer persona’) there’s potential alignment value there IMO.

eggsyntax Mar 13, 2025, 1:19 PM
55 points
25
on: eggsyntax’s Shortform
...soon the AI rose and the man died^[1]. He went to Heaven. He finally got his chance to discuss this whole situation with God, at which point he exclaimed, “I had faith in you but you didn’t save me, you let me die. I don’t understand why!”
God replied, “I sent you non-agentic LLMs and legible chain of thought, what more did you want?”
1. ^
  https://en.wikipedia.org/wiki/Parable_of_the_drowning_man

eggsyntax Mar 12, 2025, 10:17 PM
2 points
0
in reply to: gwern’s comment on: What’s up with LLMs representing XORs of arbitrary features?
and the tokens/activations are all still very local because you’re still early in the forward pass
I don’t understand why this would necessarily be true, since attention heads have access to values for all previous token positions. Certainly, there’s been less computation at each token position in early layers, so I could imagine there being less value to retrieving information from earlier tokens. But on the other hand, I could imagine it sometimes being quite valuable in early layers just to know what tokens had come before.

eggsyntax Mar 7, 2025, 4:09 PM
3 points
0
on: AI Safety Policy Won’t Go On Like This – AI Safety Advocacy Is Failing Because Nobody Cares.
For me as an outsider, it still looks like the AI safety movement is only about „how do we prevent AI from killing us?“. I know it‘s an oversimplification, but that‘s how, I believe, many who don‘t really know about AI perceive it.
I don’t think it’s that much of an oversimplification, at least for a lot of AIS folks. Certainly that’s a decent summary of my central view. There are other things I care about—eg not locking in totalitarianism—but they’re pretty secondary to ‘how do we prevent AI from killing us?’. For a while there was an effort in some quarters to rebrand as AINotKillEveryoneism which I think does a nice job centering the core issue.
It may as you say be unsexy, but it’s still the thing I care about; I strongly prefer to live, and I strongly prefer for everyone’s children and grandchildren to get to live as well.

eggsyntax Feb 21, 2025, 7:57 PM
7 points
0
on: Do models know when they are being evaluated?
We create a small dataset of chat and agentic settings from publicly available benchmarks and datasets.
I believe there are some larger datasets of relatively recent real chat evaluations, eg the LMSYS dataset was most recently updated in July (I’m assuming but haven’t verified that the update added more recent chats).

eggsyntax Feb 12, 2025, 2:15 AM
3 points
0
in reply to: Eleni Angelou’s comment on: A Problem to Solve Before Building a Deception Detector
Can you clarify what you mean by ‘neural analog’ / ‘single neural analog’? Is that meant as another term for what the post calls ‘simple correspondences’?
Even if all the safety-relevant properties have them, there’s no reason to believe (at least for now) that we have the interp tools to find them in time i.e., before having systems fully capable of pulling off a deception plan.
Agreed. I’m hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I’m skeptical that that’ll happen. Or alternately I’m hopeful that we turn out to be in an easy-mode world where there is something like a single ‘deception’ direction that we can monitor, and that’ll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else).
I’m also worried that claims such as “we can make important forward progress on particular intentional states even in the absence of such a general account.” could further lead to a slippery slope that more or less embraces having the dangerous thing first without sufficient precautions
I agree that that’s a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn’t necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.

eggsyntax 12 Feb 2025 1:42 UTC
3 points
0
in reply to: lewis smith’s comment on: A Problem to Solve Before Building a Deception Detector
i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence.
I haven’t read that sequence, I’ll check it out, thanks. I’m thinking of work like the ROME paper from David Bau’s lab that suggest that fact storage can be identified and edited, and various papers like this one from Mor Geva+ that find evidence that the MLP layers in LLMs are largely key-value stores.
relatedly, your second bullet point assumes that you can identify the ‘fact’ related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
It does seem like a naive approach would require pre-identifying all facts you wanted to track. On the other hand, I can imagine an approach like analyzing the output for factual claims and then searching for those in the record of activations during the output. Not sure, seems very TBD.
I think that detecting/preventing models from knowingly lying would be a good research direction and it’s clearly related to strategic deception, but I’m not actually sure that it’s a superset (consider a case when I’m bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don’t know or care whether what I’m saying is true or false or whatever).
Great point! I can certainly imagine that there could be cases like that, although I can equally imagine that LLMs could be consistently tracking the truth value of claims even if that isn’t a big factor determining the output.
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!
That seems reasonable. I’ve mostly had the impression that 1 has generally been true in specific cases where researchers have looked for it, but it’s definitely not something I’ve specifically gone looking for. I’ll be interested to read the sequence from DeepMind.