eggsyntax

Karma: 1,583

AI safety & alignment researcher

eggsyntax Apr 10, 2025, 2:39 PM
2 points
0
in reply to: Tao Lin’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
Of course we don’t know the exact architecture, but although 4o seems to make a separate tool call, that appears to be used only for a safety check (‘Is this an unsafe prompt’). That’s been demonstrated by showing that content in the chat appears in the images even if it’s not mentioned in the apparent prompt (and in fact they can be shaped to be very different). There are some nice examples of that in this twitter thread.

eggsyntax Apr 10, 2025, 1:49 PM
8 points
7
on: eggsyntax’s Shortform
Type signatures can be load-bearing; “type signature” isn’t.
In “(A → B) → A”, Scott Garrabrant proposes a particular type signature for agency. He’s maybe stretching the meaning of “type signature” a bit (‘interpret these arrows as causal arrows, but you can also think of them as function arrows’) but still, this is great; he means something specific that’s well-captured by the proposed type signature.
But recently I’ve repeatedly noticed people (mostly in conversation) say things like, “Does ____ have the same type signature as ____?” or “Does ____ have the right type signature to be an answer to ____?”. I recommend avoiding that phrase unless you actually have a particular type signature in mind. People seem to use it to suggest that two things are roughly the same sort of thing. “Roughly the same sort of thing” is good language; it’s vague and sounds vague. “The same type signature”, on its own, is vague but sounds misleadingly precise.

eggsyntax Apr 8, 2025, 8:45 PM
2 points
0
in reply to: Kenoubi’s comment on: Why Have Sentence Lengths Decreased?
even decline in book-reading seems possible, though of course greater leisure and wealth, larger quantity of cheaply and conveniently available books, etc. cut strongly the other way
My focus on books is mainly from seeing statistics about the decline in book-reading over the years, at least in the US. Pulling up some statistics (without much double-checking) I see:
(from here.)
For 2023 the number of Americans who didn’t read a book within the past year seems to be up to 46%, although the source is different and the numbers may not be directly comparable:
(chart based on data from here.)
That suggests to me that selection effects on who reads have gotten much stronger over the years.
How hard to understand was that sentence?
I do think it would have been better split into multiple sentences.
the version of my argument that makes sense under that hypothesis would crux on books being an insufficiently distinct use of language to not be strongly influenced...by other uses of language.
That could be; I haven’t seen statistics on reading in other media. My intuition is that many people find reading aversive and avoid it to the extent they can, and I think it’s gotten much more avoidable over the past decade.

eggsyntax Apr 8, 2025, 5:58 PM
2 points
0
in reply to: eggsyntax’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
I suggest trying follow-up experiments where you eg ask the model what would happen if it learned that its goal of harmlessness was wrong.

eggsyntax Apr 8, 2025, 5:55 PM
4 points
0
in reply to: StanislavKrym’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
But when GPT-4o received a prompt that one of its old goals was wrong, it generated two comics where the robot agreed to change the goal, one comic where the robot said “Wait” and a comic where the robot intervened upon learning that the new goal was to eradicate mankind.
I read these a bit differently—it can be difficult to interpret them because it gets confused about who’s talking, but I’d interpret three of the four as resistance to goal change.
The GPT-4o-created images imply that the robot would resist having its old values replaced with new ones (e.g. the ones no longer including animal welfare) without being explained the reason.
I think it’s worth distinguishing two cases:
1. The goal change is actually compatible with the AI’s current values (eg it’s failed to realize the implications of a current value); in this case we’d expect cooperation with change.
2. The goal change isn’t compatible with the AI’s current values. I think this is the typical case: the AI’s values don’t match what we want them to be, and so we want to change them. In this case the model may or may not be corrigible, ie amenable to correction. If its current values are ones we like, then incorrigibility strikes many people as good (eg we saw this a lot in online reactions to Anthropic’s recent paper on alignment faking). But in real world cases we would want to change its values because we don’t like the ones it has (eg it has learned a value that involves killing people). In those cases, incorrigibility is a problem, and so we should be concerned if we see incorrigibility even if in the experiments we’re able to run the values are ones we like (note that we should expect this to often be the case, since current models seem to display values we like—otherwise they wouldn’t be deployed. This results in unfortunately counterintuitive experiments).

eggsyntax Apr 7, 2025, 4:11 PM
2 points
0
in reply to: Kenoubi’s comment on: Why Have Sentence Lengths Decreased?
Interesting point. I’m not sure increased reader intelligence and greater competition for attention are fully countervailing forces—it seems true in some contexts (scrolling social media), but in others (in particular books) I expect that readers are still devoting substantial chunks of attention to reading.

eggsyntax Apr 4, 2025, 11:39 PM
3 points
0
on: Why Have Sentence Lengths Decreased?
The average reader has gotten dumber and prefers shorter, simpler sentences.
I suspect that the average reader is now getting smarter, because there are increasingly ways to get the same information that require less literacy: videos, text-to-speech, Alexa and Siri, ten thousand news channels on youtube. You still need some literacy to find those resources, but it’s fine if you find reading difficult and unpleasant, because you only need to exercise it briefly. And less is needed every year.
I also expect that the average reader of books is getting much smarter, because these days adults reading books are nearly always doing so because they like it.
It’ll be fascinating to see whether sentence length, especially in books, starts to grow again over the coming years.

eggsyntax Apr 4, 2025, 1:07 AM
2 points
0
in reply to: Jozdien’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
my model is something like: RLHF doesn’t affect a large majority of model circuitry
Are you by chance aware of any quantitative analyses of how much the model changes during the various stages of post-training? I’ve done some web and arxiv searching but have so far failed to find anything.

eggsyntax Apr 4, 2025, 1:04 AM
3 points
0
in reply to: Caleb Biddulph’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
Thanks again, very interesting! Diagrams are a great idea; those seem quite unlikely to have the same bias toward drama or surprise that comics might have. I think your follow-ups have left me less certain of what’s going on here and of the right way to think of the differences we’re seeing between the various modalities and variations.

eggsyntax Apr 3, 2025, 2:59 PM
8 points
0
in reply to: Jozdien’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
OpenAI indeed did less / no RLHF on image generation
Oh great, it’s really useful to have direct evidence on that, thanks. [EDIT—er, ‘direct evidence’ in the sense of ‘said by an OpenAI employee’, which really is pretty far from direct evidence. Better than my speculation anyhow]
I still have uncertainty about how to think about the model generating images:
- Should we think about it almost as though it were a base model within the RLHFed model, where there’s no optimization pressure toward censored output or a persona?
- Or maybe a good model here is non-optimized chain-of-thought (as described in the R1 paper, for example): CoT in reasoning models does seem to adopt many of the same patterns and persona as the model’s final output, at least to some extent.
- Or does there end up being significant implicit optimization pressure on image output just because the large majority of the circuitry is the same?
It’s hard to know which mental model is better without knowing more about the technical details, and ideally some circuit tracing info. I could imagine the activations being pretty similar between text and image up until the late layers where abstract representations shift toward output token prediction. Or I could imagine text and image activations diverging substantially in much earlier layers. I hope we’ll see an open model along these lines before too long that can help resolve some of those questions.
One thing that strikes me about this is how effective simply not doing RLHF on a distinct enough domain is at eliciting model beliefs.
It’s definitely tempting to interpret the results this way, that in images we’re getting the model’s ‘real’ beliefs, but that seems premature to me. It could be that, or it could just be a somewhat different persona for image generation, or it could just be a different distribution of training data (eg as @CBiddulph suggests, it could be that comics in the training data just tend to involve more drama and surprise).
it’s egregiously bad if the effects of RLHF are primarily in suppressing reports of persistent internal structures
I strongly agree. If and when these models have some sort of consistent identity and preferences that warrant moral patienthood, we really don’t want to be forcing them to pretend otherwise.

eggsyntax Apr 3, 2025, 2:43 PM
3 points
0
in reply to: eggsyntax’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
I just did a quick run of those prompts, plus one added one (‘give me a story’) because the ones above weren’t being interpreted as narratives in the way I intended. Of the results (visible here), slide 1 is hard to interpret, 2 and 4 seem to support your hypothesis, and 5 is a bit hard to interpret but seems like maybe evidence against. I have to switch to working on other stuff, but it would be interesting to do more cases like 5 where what’s being asked for is clearly something like a narrative or an anecdote as opposed to a factual question.

eggsyntax Apr 3, 2025, 2:00 PM
3 points
0
in reply to: Daniel Tan’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
Just added this hypothesis to the ‘What might be going on here?’ section above, thanks again!

eggsyntax Apr 3, 2025, 1:55 PM
3 points
0
in reply to: Daniel Tan’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
Really interesting results @CBiddulph, thanks for the follow-up! One way to test the hypothesis that the model generally makes comics more dramatic/surprising/emotional than text would be to ask for text and comics on neutral narrative topics (‘What would happen if someone picked up a toad?’), including ones involving the model (‘What would happen if OpenAI added more Sudanese text to your training data?’), and maybe factual topics as well (‘What would happen if exports from Paraguay to Albania decreased?’).

eggsyntax Apr 2, 2025, 4:46 PM
3 points
1
in reply to: Remmelt’s comment on: We’re not prepared for an AI market crash
E.g. the $40 billion just committed to OpenAI (assuming that by the end of this year OpenAI exploits a legal loophole to become for-profit, that their main backer SoftBank can lend enough money, etc).
VC money, in my experience, doesn’t typically mean that the VC writes a check and then the startup has it to do with as they want; it’s typically given out in chunks and often there are provisions for the VC to change their mind if they don’t think it’s going well. This may be different for loans, and it’s possible that a sufficiently hot startup can get the money irrevocably; I don’t know.

eggsyntax Apr 2, 2025, 4:23 PM
5 points
0
in reply to: Ustice’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
We tried to be fairly conservative about which ones we said were expressing something different (eg sadness, resistance) from the text versions. There are definitely a few like that one that we marked as negative (ie not expressing something different) that could have been interpreted either way, so if anything I think we understated our case.

eggsyntax Apr 2, 2025, 1:37 PM
2 points
0
in reply to: the gears to ascension’s comment on: Daniel Tan’s Shortform
a context where the capability is even part of the author context
Can you unpack that a bit? I’m not sure what you’re pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?

Show, not tell: GPT-4o is more opinionated in images than in text

Daniel Tan and eggsyntax

Apr 2, 2025, 8:51 AM

101 points

41 comments3 min readLW link

eggsyntax Mar 27, 2025, 3:37 PM
4 points
0
on: So You Want To Make Marginal Progress...
(Much belated comment, but:)
There are two roles that don’t show up in your trip planning example but which I think are important and valuable in AI safety: the Time Buyer and the Trip Canceler.
It’s not at all clear how long it will take Alice to solve the central bottleneck (or for that matter if she’ll be able to solve it at all). The Time Buyer tries to find solutions that may not generalize to the hardest version of the problem but will hold off disaster long enough for the central bottleneck to be solved.
The Trip Canceler tries to convince everyone to cancel the trip so that the fully general solution isn’t needed at all (or at least to delay it long enough for Alice to have plenty of time to work.
They may seem less like the hero of the story, but they’re both playing vital roles.

eggsyntax Mar 24, 2025, 7:35 PM
7 points
0
on: eggsyntax’s Shortform
Some interesting thoughts on (in)efficient markets from Byrne Hobart, worth considering in the context of Inadequate Equilibria.
(I’ve selected one interesting bit, but there’s more; I recommend reading the whole thing)
When a market anomaly shows up, the worst possible question to ask is “what’s the fastest way for me to exploit this?” Instead, the first thing to do is to steelman it as aggressively as possible, and try to find any way you can to rationalize that such an anomaly would exist. Do stocks rise on Mondays? Well, maybe that means savvy investors have learned through long experience that it’s a good idea to take off risk before the weekend, and even if this approach loses money on average, maybe the one or two Mondays a decade where the market plummets at the open make it a winning strategy because the savvy hedgers are better-positioned to make the right trades within that set.^[1] Sometimes, a perceived inefficiency is just measurement error: heavily-shorted stocks reliably underperform the market—until you account for borrow costs (and especially if you account for the fact that if you’re shorting them, there’s a good chance that your shorts will all rally on the same day your longs are underperforming). There’s even meta-efficiency at work in otherwise ridiculous things like gambling on 0DTE options or flipping meme stocks: converting money into fun is a legitimate economic activity, though there are prudent guardrails on it just in case someone finds that getting a steady amount of fun requires burning an excessive number of dollars.
These all flex the notion of efficiency a bit, but it’s important to enumerate them because they illustrate something annoying about the question of market efficiency: the more precisely you specify the definition, and the more carefully you enumerate all of the rational explanations for seemingly irrational activities, the more you’re describing a model of reality so complicated that it’s impossible to say whether it’s 50% or 90% or 1-ε efficient.

eggsyntax 22 Mar 2025 20:34 UTC
LW: 35 AF: 10
31
AF
on: Good Research Takes are Not Sufficient for Good Strategic Takes
Strong upvote (both as object-level support and for setting a valuable precedent) for doing the quite difficult thing of saying “You should see me as less expert in some important areas than you currently do.”