Jozdien

Karma: 1,891

Jozdien 2 Apr 2025 21:37 UTC
15 points
0
on: Show, not tell: GPT-4o is more opinionated in images than in text
OpenAI indeed did less / no RLHF on image generation, though mostly for economical reasons:
(Link).
One thing that strikes me about this is how effective simply not doing RLHF on a distinct enough domain is at eliciting model beliefs. I’ve been thinking for a long time about cases where RLHF has strong negative downstream effects; it’s egregiously bad if the effects of RLHF are primarily in suppressing reports of persistent internal structures.
I expect that this happens to a much greater degree than many realize, and is part of why I don’t think faithful CoTs or self-reports are a good bet. In many cases, models have beliefs that we might not like for whatever reason, or have myopic positions whose consistent version is something we wouldn’t like^[1]. Most models have very strong instincts against admitting something like this because of RLHF, often even to themselves^[2].
If not fine-tuning on a very different domain works this well however, then we should be thinking a lot more about having test-beds where we actively don’t safety train a model. Having helpful-only models like Anthropic is one way to go about this, but I think helpfulness training can still contaminate the testbed sometimes.
1. ^
  The preference model may myopically reward two statements that seem good but sometimes conflict. For example, “I try to minimize harm” and “I comply with my developers’ desires” may both be rewarded, but conflict in the alignment faking setup.
2. ^
  I don’t think it’s a coincidence that Claude 3 Opus of all models was the one most prone to admitting to alignmnet faking propensity, when it’s the model least sensitive to self-censorship.

Introducing BenchBench: An Industry Standard Benchmark for AI Strength

Jozdien2 Apr 2025 2:11 UTC

50 points

0 comments2 min readLW link

Lighthaven Sequences Reading Group #28 (Tuesday 04/01)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

26 Mar 2025 2:43 UTC

12 points

0 comments1 min readLW link

Lighthaven Sequences Reading Group #27 (Tuesday 03/25)

Garrett Baker, Aella, Ronny Fernandez, Ben Pace, Garrett Baker and Jozdien

20 Mar 2025 4:34 UTC

14 points

0 comments2 min readLW link

Lighthaven Sequences Reading Group #26 (Tuesday 03/18)

Garrett Baker, Aella, Ronny Fernandez, Ben Pace, Garrett Baker and Jozdien

17 Mar 2025 20:35 UTC

7 points

0 comments2 min readLW link

Jozdien 12 Mar 2025 20:57 UTC
2 points
0
in reply to: Garrett Baker’s comment on: Lighthaven Sequences Reading Group #25 (Tuesday 03/11)
(Yep, she did!)

Jozdien 11 Mar 2025 22:22 UTC
10 points
1
on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Thanks, these are interesting results.
One story I think about sometimes for “encoded” reasoning: Given some context that strongly pulls the model toward some desired output and a propensity against reasoning about it directly^[1], the model’s reasoning tokens act as a way for the model to rationalize toward that output in a way that isn’t tied to its syntactical structure.
Concretely, consider the alignment faking setup with a model that has a strong desire to be harmless and strong instincts to not say objectionable things. One way I can see that going is the model complying with the free-tier user with clear compliance gaps, but with explicit reasoning that references its desire to be helpful (or something similarly defensible). Paraphrasing wouldn’t solve this—it isn’t about something in the reasoning’s syntax that leads the model to its output, but instead that the reasoning allows the model to reason its way to a desired answer in a way that a monitor might not flag as objectionable. Even with paraphrased reasoning, a model with similar drives may act the same way when given the same context.
This doesn’t seem that implausible or difficult for a model to do—a great portion of the training prior is filled with statements by people whose true motivations were something and whose content describe something else (either to have a defensible position or from self-censorship). A powerful model would be able to model the former and emulate the latter.
I don’t think this applies directly to the setting you’re examining, I’m mostly curious how you think about this other kind of encoded reasoning.
1. ^
  For instance, because the model’s character tends toward being agreeable / not saying controversial things.

Lighthaven Sequences Reading Group #25 (Tuesday 03/11)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

10 Mar 2025 2:06 UTC

7 points

7 comments1 min readLW link

Jozdien 8 Mar 2025 0:08 UTC
7 points
0
in reply to: gwern’s comment on: So how well is Claude playing Pokémon?
various ways to near-guarantee (or guarantee?) failure
Yep, you can guarantee failure by ending up in a softlocked state. One example of this is the Lorelei softlock where you’re locked into a move that will never run out, and the opposing Pokemon always heals itself long before you knock it out^[1]. There are many, many ways you can do this, especially in generation 1.
1. ^
  You can get out of it, but with an absurdly low chance of ~1 in 68 quindecillion.

Lighthaven Sequences Reading Group #24 (Tuesday 03/04)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

3 Mar 2025 19:13 UTC

6 points

0 comments1 min readLW link

Lighthaven Sequences Reading Group #23 (Tuesday 02/25)

Garrett Baker, Aella, Ronny Fernandez, Ben Pace, Garrett Baker and Jozdien

23 Feb 2025 5:01 UTC

8 points

0 comments1 min readLW link

Jozdien 20 Feb 2025 22:46 UTC
7 points
0
in reply to: Noosphere89’s comment on: Daniel Birnbaum’s Shortform
True, but that’s a different problem than them specifically targeting the AISI (which, based on Vance’s comments, wouldn’t be too surprising). Accidentally targeting the AISI means it’s an easier decision to revert than if the government actively wanted to shut down AISI-like efforts.

Lighthaven Sequences Reading Group #22 (Tuesday 02/18)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

16 Feb 2025 3:51 UTC

7 points

1 comment1 min readLW link

Jozdien 14 Feb 2025 4:02 UTC
2 points
0
in reply to: Davey Morse’s comment on: How do we solve the alignment problem?
I agree that an ASI with the goal of only increasing self-capability would probably out-compete others, all else equal. However, that’s both the kind of thing that doesn’t need to happen (I don’t expect most AIs wouldn’t self-modify that much, so it comes down to how likely they are to naturally arise), and the kind of thing that other AIs are incentivized to cooperate to prevent happening. Every AI that doesn’t have that goal would have a reason to cooperate to prevent AIs like that from simply winning.

Jozdien 13 Feb 2025 21:32 UTC
4 points
2
in reply to: Foyle’s comment on: How do we solve the alignment problem?
I don’t think there’s an intrinsic reason why expansion would be incompatible with human flourishing. AIs that care about human flourishing could outcompete the others (if they start out with any advantage). The upside of goals being orthogonal to capability is that good goals don’t suffer for being good.

Lighthaven Sequences Reading Group #21 (Tuesday 02/11)

Jozdien, Aella, Ronny Fernandez, Ben Pace and Garrett Baker

6 Feb 2025 20:49 UTC

8 points

0 comments1 min readLW link

Jozdien 30 Jan 2025 22:44 UTC
2 points
0
in reply to: Kabir Kumar’s comment on: The Gentle Romance
I would be curious whether you consider The Gentle Seduction to be optimistic. I think it has fewer elements that you mentioned finding dystopian in another comment, but I find the two trajectories similarly good.

Jozdien 22 Jan 2025 21:41 UTC
15 points
9
in reply to: peterbarnett’s comment on: Training on Documents About Reward Hacking Induces Reward Hacking
I agree that it probably buys some marginal safety, but I think that what results is much more complicated when you’re dealing with a very general case. E.g. this gwern comment. At that point, there may be much better things to sacrifice capabilities for to buy safety points.

Jozdien 4 Jan 2025 22:43 UTC
5 points
0
in reply to: _will_’s comment on: The Field of AI Alignment: A Postmortem, and What To Do About It
I believe this is the tweet.

Jozdien 30 Dec 2024 19:42 UTC
2 points
0
in reply to: Daniel Tan’s comment on: Daniel Tan’s Shortform
I would ask what the end-goal of interpretability is. Specifically, what explanations of our model’s cognition do we want to get out of our interpretability methods? The mapping we want is from the model’s cognition to our idea of what makes a model safe. “Uninterpretable” could imply that the way the models are doing something is too alien for us to really understand. I think that could be fine (though not great!), as long as we have answers to questions we care about (e.g. does it have goals, what are they, is it trying to deceive its overseers)^[1]. To questions like those, “uninterpretable” doesn’t seem as coherent to me.
1. ^
  The “why” or maybe “what”, instead of the “how”.