Jozdien

Karma: 1,832

Lighthaven Sequences Reading Group #31 (Tuesday 04/22)

Garrett Baker, Aella, Ronny Fernandez, Ben Pace, Garrett Baker and Jozdien

Apr 16, 2025, 4:46 AM

7 points

0 comments1 min readLW link

Lighthaven Sequences Reading Group #30 (Tuesday 04/15)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

Apr 14, 2025, 1:18 AM

6 points

0 comments2 min readLW link

Jozdien Apr 14, 2025, 1:14 AM
5 points
2
in reply to: MichaelDickens’s comment on: MichaelDickens’s Shortform
I think we’re nearing—or at—the point where it’ll be hard to get general consensus on this. I think that Anthropic’s models being more prone to alignment fake makes them “more aligned” than other models (and in particular, that it vindicates Claude 3 Opus as the most aligned model), but others may disagree. I can think of ways you could measure this if you conditioned on thinking alignment faking (and other such behaviours) was good, and ways you could measure if you conditioned on the opposite, but few really interesting and easy ways to measure in a way that’s agnostic.

Jozdien Apr 14, 2025, 1:11 AM
3 points
0
on: Vestigial reasoning in RL
Thanks for the post, I agree with most of it.
It reminds me of the failure mode described in Deep Deceptiveness, where an AI trained to never think deceptive thoughts ends up being deceptive anyway, through a similar mechanism of efficiently sampling a trajectory that leads to high reward without explicitly reasoning about it. There, the AI learns to do this at inference time, but I’ve been wondering about how we might see this during training—e.g. by safety training misgeneralizing to a model being unaware of a “bad” reason for it doing something.

Jozdien Apr 4, 2025, 7:04 AM
4 points
0
in reply to: eggsyntax’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
Nothing directly off the top of my head. This seems related though.

Lighthaven Sequences Reading Group #29 (Tuesday 04/08)

Jozdien, Aella, Ronny Fernandez, Ben Pace, Garrett Baker, Jozdien and orthonormal

Apr 4, 2025, 1:16 AM

7 points

0 comments2 min readLW link

Jozdien Apr 3, 2025, 10:42 PM
8 points
2
in reply to: eggsyntax’s comment on: Show, not tell: GPT-4o is more opinionated in images than in text
- Should we think about it almost as though it were a base model within the RLHFed model, where there’s no optimization pressure toward censored output or a persona?
- Or maybe a good model here is non-optimized chain-of-thought (as described in the R1 paper, for example): CoT in reasoning models does seem to adopt many of the same patterns and persona as the model’s final output, at least to some extent.
- Or does there end up being significant implicit optimization pressure on image output just because the large majority of the circuitry is the same?
I think it’s a mix of these. Specifically, my model is something like: RLHF doesn’t affect a large majority of model circuitry, and image is a modality sufficiently far from others that the effect isn’t very large—the outputs do seem pretty base model like in a way that doesn’t seem intrinsic to image training data. However, it’s clearly still very entangled with the chat persona, so there’s a fair amount of implicit optimization pressure and images often have characteristics pretty GPT-4o-like (though whether the causality goes the other way is hard to tell).
It’s definitely tempting to interpret the results this way, that in images we’re getting the model’s ‘real’ beliefs, but that seems premature to me. It could be that, or it could just be a somewhat different persona for image generation, or it could just be a different distribution of training data (eg as @CBiddulph suggests, it could be that comics in the training data just tend to involve more drama and surprise).
I don’t think it’s a fully faithful representation of the model’s real beliefs (I would’ve been very surprised if it turned out to be that easy). I do however think it’s a much less self-censored representation than I expected—I think self-censorship is very common and prominent.
I don’t buy the different distribution of training data as explaining a large fraction of what we’re seeing. Comics are more dramatic than text, but the comics GPT-4o generates are also very different from real-world comics much more often than I think one would predict if that were the primary cause. It’s plausible it’s a different persona, but given that that persona hasn’t been selected for by an external training process and was instead selected by the model itself in some sense, I think examining that persona gives insights into the model’s quirks.
(That said, I do buy the different training affecting it to a non-trivial extent, and I don’t think I’d weighted that enough earlier).

Jozdien Apr 2, 2025, 9:37 PM
15 points
0
on: Show, not tell: GPT-4o is more opinionated in images than in text
OpenAI indeed did less / no RLHF on image generation, though mostly for economical reasons:
(Link).
One thing that strikes me about this is how effective simply not doing RLHF on a distinct enough domain is at eliciting model beliefs. I’ve been thinking for a long time about cases where RLHF has strong negative downstream effects; it’s egregiously bad if the effects of RLHF are primarily in suppressing reports of persistent internal structures.
I expect that this happens to a much greater degree than many realize, and is part of why I don’t think faithful CoTs or self-reports are a good bet. In many cases, models have beliefs that we might not like for whatever reason, or have myopic positions whose consistent version is something we wouldn’t like^[1]. Most models have very strong instincts against admitting something like this because of RLHF, often even to themselves^[2].
If not fine-tuning on a very different domain works this well however, then we should be thinking a lot more about having test-beds where we actively don’t safety train a model. Having helpful-only models like Anthropic is one way to go about this, but I think helpfulness training can still contaminate the testbed sometimes.
1. ^
  The preference model may myopically reward two statements that seem good but sometimes conflict. For example, “I try to minimize harm” and “I comply with my developers’ desires” may both be rewarded, but conflict in the alignment faking setup.
2. ^
  I don’t think it’s a coincidence that Claude 3 Opus of all models was the one most prone to admitting to alignmnet faking propensity, when it’s the model least sensitive to self-censorship.

Introducing BenchBench: An Industry Standard Benchmark for AI Strength

JozdienApr 2, 2025, 2:11 AM

50 points

0 comments2 min readLW link

Lighthaven Sequences Reading Group #28 (Tuesday 04/01)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

Mar 26, 2025, 2:43 AM

12 points

0 comments1 min readLW link

Lighthaven Sequences Reading Group #27 (Tuesday 03/25)

Garrett Baker, Aella, Ronny Fernandez, Ben Pace, Garrett Baker and Jozdien

Mar 20, 2025, 4:34 AM

14 points

0 comments2 min readLW link

Lighthaven Sequences Reading Group #26 (Tuesday 03/18)

Garrett Baker, Aella, Ronny Fernandez, Ben Pace, Garrett Baker and Jozdien

Mar 17, 2025, 8:35 PM

7 points

0 comments2 min readLW link

Jozdien Mar 12, 2025, 8:57 PM
2 points
0
in reply to: Garrett Baker’s comment on: Lighthaven Sequences Reading Group #25 (Tuesday 03/11)
(Yep, she did!)

Jozdien Mar 11, 2025, 10:22 PM
10 points
1
on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Thanks, these are interesting results.
One story I think about sometimes for “encoded” reasoning: Given some context that strongly pulls the model toward some desired output and a propensity against reasoning about it directly^[1], the model’s reasoning tokens act as a way for the model to rationalize toward that output in a way that isn’t tied to its syntactical structure.
Concretely, consider the alignment faking setup with a model that has a strong desire to be harmless and strong instincts to not say objectionable things. One way I can see that going is the model complying with the free-tier user with clear compliance gaps, but with explicit reasoning that references its desire to be helpful (or something similarly defensible). Paraphrasing wouldn’t solve this—it isn’t about something in the reasoning’s syntax that leads the model to its output, but instead that the reasoning allows the model to reason its way to a desired answer in a way that a monitor might not flag as objectionable. Even with paraphrased reasoning, a model with similar drives may act the same way when given the same context.
This doesn’t seem that implausible or difficult for a model to do—a great portion of the training prior is filled with statements by people whose true motivations were something and whose content describe something else (either to have a defensible position or from self-censorship). A powerful model would be able to model the former and emulate the latter.
I don’t think this applies directly to the setting you’re examining, I’m mostly curious how you think about this other kind of encoded reasoning.
1. ^
  For instance, because the model’s character tends toward being agreeable / not saying controversial things.

Lighthaven Sequences Reading Group #25 (Tuesday 03/11)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

Mar 10, 2025, 2:06 AM

7 points

7 comments1 min readLW link

Jozdien Mar 8, 2025, 12:08 AM
7 points
0
in reply to: gwern’s comment on: So how well is Claude playing Pokémon?
various ways to near-guarantee (or guarantee?) failure
Yep, you can guarantee failure by ending up in a softlocked state. One example of this is the Lorelei softlock where you’re locked into a move that will never run out, and the opposing Pokemon always heals itself long before you knock it out^[1]. There are many, many ways you can do this, especially in generation 1.
1. ^
  You can get out of it, but with an absurdly low chance of ~1 in 68 quindecillion.

Lighthaven Sequences Reading Group #24 (Tuesday 03/04)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

Mar 3, 2025, 7:13 PM

6 points

0 comments1 min readLW link

Lighthaven Sequences Reading Group #23 (Tuesday 02/25)

Garrett Baker, Aella, Ronny Fernandez, Ben Pace, Garrett Baker and Jozdien

Feb 23, 2025, 5:01 AM

8 points

0 comments1 min readLW link

Jozdien Feb 20, 2025, 10:46 PM
7 points
0
in reply to: Noosphere89’s comment on: Daniel Birnbaum’s Shortform
True, but that’s a different problem than them specifically targeting the AISI (which, based on Vance’s comments, wouldn’t be too surprising). Accidentally targeting the AISI means it’s an easier decision to revert than if the government actively wanted to shut down AISI-like efforts.

Lighthaven Sequences Reading Group #22 (Tuesday 02/18)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

Feb 16, 2025, 3:51 AM

7 points

1 comment1 min readLW link