Fabien Roger

Karma: 5,172

Fabien Roger Apr 18, 2025, 9:45 AM
1 point
0
in reply to: Artyom Karpov’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
I agree some form of obfuscation is plausible in these scenarios and Deepseek was trained against a strong CoT monitor + outcome-based feedback that would have been more interesting to study than math.
My intuition for why Math is still interesting: AI companies (e.g. Deepseek / OpenAI) will do A LOT of RL on math / programming. How likely encoded reasoning is to emerge is a function of how much RL you do on a given task, and how strong incentives for encoded reasoning are. Currently people don’t seem to be doing much optimization against CoT in big reasoning RL, but they are doing lots of RL in settings where encoded reasoning may have been incentivized because English might be inefficient (in length and expressiveness). Therefore, the big reasoning RL tasks like math are where encoded reasoning is among the most likely tasks to be observed if models learned encoded reasoning during RL.

Fabien Roger Apr 15, 2025, 2:49 PM
LW: 2 AF: 2
0
AF
in reply to: Artyom Karpov’s comment on: Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
The length of Claude 3.7 Sonnet scratchpads can be controlled, and it was also the case for the early snapshot I used in these experiments, though all scratchpads used here had a relatively high thinking budget. I think the experiments here mostly rule out the position some people had that there would be a lot of hidden reasoning even with relatively mild length control because English is very inefficient compare to steganography.
I agree it would be interesting to run the same experiments on scratchpads with a smaller thinking budget.
I agree paraphrasing using Claude 3.5 Sonnet is not ideal, and I’d be curious to see what results would be when paraphrasing with other models. Just noting that you need to make sure the models are smart enough to agree to paraphrase long text (it was very hard to get Claude 3.5 Haiku to actually do that, which is why I gave up and focused on Claude 3.5 Sonnet paraphrases). I am not very worried: I’d be very surprised if Claude 3.5 Sonnet was able to learn/recognize and keep the encoded reasoning Claude 3.7 Sonnet uses. But I am curious how the exact paraphrasing scheme used (both the model and the prompt) influences performance.

Fabien Roger Apr 14, 2025, 3:25 PM
16 points
2
on: AI 2027: What Superintelligence Looks Like
Jack Clark in the Import AI newsletter:
AI 2027 tells you why people are obsessed with and fearful of AI:
…The best treatment yet of what ‘living in an exponential’ might look like…
There’s been so much chatter about AI 2027 that I expect the majority of readers of Import AI have read it. If you haven’t, please do—it’s a well rendered technically-astute narrative of the next few years of AI development and paints a picture of how today’s AI systems might turn into superintelligences that upend the order of the world. It even features a ‘good’ ending and a ‘bad’ ending, which readers may find helpful for understanding why people worry so much about misaligned AIs. It’s very, very good, and likely much of it will come true.

Why this matters—the future of the species is being decided without governments: What comes through in the 2027 scenario is the almost total absence of governments as the heart of the invention of the technology, instead their roles mostly come down to a) building infrastructure to support company-led efforts, b) adding security, and c) eventually negotiating treaties with one another.
This makes sense but it’s also deeply worrying. The magnitude of the technology being described in 2027 is so vast and its impacts so broad that the bit-player status of governments seems ahistorical and bizarre, as if part of a cosmic joke where someone has booted up the simulator with the prompt “give the species the most dangerous and least governed way to build species-transcending technology”.

Fabien Roger Apr 11, 2025, 9:48 PM
LW: 2 AF: 2
0
AF
in reply to: Fabien Roger’s comment on: Fabien’s Shortform
I tried again with Mixtral 8x7B (base) and did not get meaningful gaps between 0-shot, few-shot and fancy prompts (no matter the list size, with n=100, all accuracies are within error bars of each others, with the expected trend of more shot is better if you squint). Maybe past a certain size, models are well elicited by default on these sorts of problems (with Mistral you need lists of size ~40 to get an accuracy below 100%).
More detailed results on davinci-002 with 2-sigma error bars (n=200):

Fabien Roger Apr 11, 2025, 6:19 PM
2 points
0
in reply to: ryan_greenblatt’s comment on: Reasoning models don’t always say what they think
I agree. As I tweeted here, I also think it’s also a somewhat positive update about having faithfulness on at least some inputs (r1 is faithful at least 9% of the time across all 6 settings vs ⁴⁄₆ for v3!).
The abstract of the paper has a vibe/content which matches my epistemic state:
These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.

Fabien Roger Apr 9, 2025, 9:17 PM
3 points
0
in reply to: jwfiredragon’s comment on: Untrusted monitoring insights from watching ChatGPT play coordination games
I think inhuman but not superhuman is fine as long as it holds that succeeding at defending against human attack is strong evidence of success against inhuman ones. If the defenses are not “overfit” to human attacks, I think it’s quite doable to get a sense of how good defenses are against inhuman attacks. For paraphrasing in particular, I’d be very surprised if the best human attacks (found in many days of research) were defeated by paraphrasing but human-level-ish LLMs (given only a few minutes of human-equivalent opaque thinking time) could figure out strategies which were not defeated by paraphrasing.
(And there is also the strategy of using RL to elicit inhuman attacks which doesn’t have the weakness you describe, but which has other issues.)

Reasoning models don’t always say what they think

Joe Benton, Ethan Perez, Vlad Mikulik and Fabien Roger

Apr 9, 2025, 7:48 PM

28 points

4 comments1 min readLW link

(www.anthropic.com)

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

John Hughes, abhayesian, Akbir Khan and Fabien Roger

Apr 8, 2025, 5:32 PM

144 points

18 comments12 min readLW link

Fabien Roger Apr 8, 2025, 11:48 AM
LW: 3 AF: 2
0
AF
on: A basic systems architecture for AI agents that do autonomous research
Does this architecture mean that you can do zero scaffold development and testing without pushing code to master? What are modifications to this system architecture that enable scaffold research?

Fabien Roger Apr 7, 2025, 11:09 PM
3 points
0
in reply to: jwfiredragon’s comment on: Untrusted monitoring insights from watching ChatGPT play coordination games
I am not imagining the human red-team coming up with a long list of attack strategies and then collaborating with the blue team to patch each of them. I am imagining sth more like the blue team having a fairly general strategy (e.g. paraphrasing), and then relying on a property like “the part of solution space the human red-team can cover is roughly as effective at beating the blue-team protocol as the AI solutions”. I agree that AIs having different kinds of ideas is a reason to expect this assumption to not hold for some protocols (e.g. for human monitoring), but as discussed here, the human red-team also has a bunch of affordances that make it plausible to get somewhat conservative evals. This seems especially true for things like steganography vs paraphrasing, where I don’t see strong reasons to expect AIs to excel relative to humans.
I guess a lot of this is contingent on what AGI will look like. Current AIs have mostly human-like insights. If AIs got very insane high-quality ideas (e.g. like move 37) in the relevant domains as we got closer to AGI, I think this would become more concerning.
There are other hopes like “train AIs to be good at collusion / steganography, and use that as red-team” that suffer less from the concerns discussed here, but they have other problems like exploration hacking.
I am not sure how much we disagree. For these sorts of difficulties both in ruling out collusion and in evaluating it, I expect that doing untrusted monitoring well enough to have it reduce risk significantly would likely require a large amount of effort.

Fabien Roger Apr 4, 2025, 1:30 PM
2 points
0
in reply to: ErickBall’s comment on: Auditing language models for hidden objectives
This is a mesa-optimizer in a weak sense of the word: it does some search/optimization. I think the model in the paper here is weakly mesa-optimizing, maybe more than base models generating random pieces of sports news, and maybe roughly as much as a model trying to follow weird and detailed instructions—except that here it follows memorized “instructions” as opposed to in-context ones.

Fabien Roger Mar 30, 2025, 10:52 AM
LW: 2 AF: 2
0
AF
on: Tracing the Thoughts of a Large Language Model
Cool work!
There are 2 findings about that I found surprising and that I’d be interested in seeing explored through other methods:
1. Some LLMs are computing the last digit in addition mostly in base 10 (even when all numbers are a single token?)
2. LLMs trained to not hallucinate sometimes decide whether to give an answer or not based on the familiarity of the entity, rather than based on the familiarity of the actual answer (they don’t try to first recover the answer and then refuse to answer if they only recall a low-confidence guess?)
The second one may imply that LLMs are less able to reason about what they are about to say than I thought.
I also find it cool that you measured how good the explanations for your new features are. I find it slightly concerning how bad the numbers are. In particular, I would have expected a sort eval error much below 2% (which is the sort eval error you would get if you perfectly assigned each dataset example to one of ~~5 balanced categories of features~~ [Edit: My math was wrong. 2% is what you get with 25 categories]), but you find a sort eval error around 10%. Some of that is probably Claude being dumb, but I guess you would also struggle to get below 2% with human labels?
But I also see how very predictive feature explanations might not be necessary. I am looking forward to seeing how Circuit Tracing performs in cases where there are more sources of external validations (e.g. hard auditing games)!

Fabien Roger Mar 29, 2025, 9:30 PM
2 points
0
in reply to: Richard_Ngo’s comment on: ricraz’s Shortform
This prevents predatory strategies from masquerading as altruism
I did not understand that. Is the worry that it’s hard to distinguish a genuine welfare maximizer from a predator because you can’t tell if they will ever give you back power? I don’t understand why this does not apply to agents pretending to pursue empowerment. It is common in conflicts to temporarily disempower someone to protect their long-term empowerment (e.g. a country mandatorily mobilizing for war against a fascist attacker, preventing a child from ignoring their homework), and it is also common to pretend to protect long-term empowerment and never give back power (e.g. a dictatorship of the proletariat never transitioning to a “true” communist economy).

Fabien Roger Mar 29, 2025, 3:37 PM
2 points
0
in reply to: Kei’s comment on: Fabien’s Shortform
I think this would have been plausible if distillation of the original scratchpads resulted in significantly worse performance compared to the model I distilled from. But the gap is small, so I think we can exclude this possibility.

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

Mar 26, 2025, 7:13 PM

41 points

0 comments4 min readLW link

(alignment.anthropic.com)

Fabien Roger Mar 24, 2025, 12:23 PM
LW: 2 AF: 2
0
AF
in reply to: Neel Nanda’s comment on: Fabien’s Shortform
They were separated by spaces. (But I’d encourage replication before updating too hard on results which I think are very weird.)

Fabien Roger Mar 23, 2025, 6:31 PM
LW: 4 AF: 3
0
AF
in reply to: Thane Ruthenis’s comment on: Fabien’s Shortform
Thanks for the example! I think it moderately convinced me that having CoT become hard for humans to understand in the next 2 years is slightly more plausible than I thought.
I still think that even for the things you described, it will be relatively easy for the base model to understand what is going on, and it’s likely that GPT-4o will too. See the discussion in another comment thread about “chicken-and-egg” problems when learning new encodings, which I think make it much easier to learn things like switching languages or skipping stopwords that take almost no additional effort to decode (therefore the it is incentivized to use the new encoding without the need to learn a new decoder). I see how once you get a model using “freestyling words” frequently the decoder becomes better over time at understanding freestyling in general (though it will be slow because it is trained using RL rather than SL), which allows you to eventually shift to encodings that the base model and GPT-4o don’t understand, but my guess is that this will take a long time (>the first 1e28 FLOP RL run?).

Fabien Roger Mar 23, 2025, 10:12 AM
27 points
4
in reply to: Sam Marks’s comment on: Sam Marks’s Shortform
I’ll add another one: in the training against the CoT monitor experiment, the model does not get a large constant negative reward for being caught (which is what you might naively do and which should intuitively work best). Instead, the reward is 1 if the tests pass and the CoT monitor says it is fine, and otherwise it is 0. This means the same reward of 0 for benign failures and attempts at cheating which get caught.
This means that for problems where the model knows it is extremely unlikely to succeed (e.g. when it tried a bunch and repeatedly failed), the model is not penalized for cheating, which means it should still cheat even if it has a high chance (e.g. 90%) of being caught by the CoT monitor.
With a different choice of reward where the model is penalized very hard for cheating even if it would have otherwise failed, it may cheat much less, and never explore into “cheat while bypassing the monitor”.

Fabien Roger Mar 22, 2025, 11:26 PM
2 points
0
in reply to: Mis-Understandings’s comment on: Fabien’s Shortform
A detail I forgot to mention and that might be important is that I was training the model to use one of several encodings (with a prefix specifying which one was used) to avoid this sort of implicit translation that goes through just changing embedding matrices. But that would have been a good sanity check to run to make sure that the model would learn to do the simple embedding matrix change when this is an optimal strategy (I would be very surprised if SGD was not able to find this simple transformation, what I claim is tricky is learning a new weird encoding while maintaining the other capabilities, though I am confused enough here that I don’t claim to make very accurate prediction in this domain).
The encoding was a lookup table.

Fabien Roger Mar 22, 2025, 10:11 PM
LW: 2 AF: 2
0
AF
in reply to: ryan_greenblatt’s comment on: Fabien’s Shortform
I think it’s much worse than that.
First, I think RL is more like 10,000x than 100x less efficient than SL (deepseek v3 probably can’t be compressed much below 10GB, while deepseek r1-zero stage can probably be compressed to 1MB of transcripts, despite both being roughly 1e24 FLOP).
Additionally, learning new languages is likely a much harder task than regular math RL, because you have chicken-and-egg issues (you are not incentivized to encode a new structure before knowing how to decode it, and you are not incentivized to decode it before you encode it). We have some empirical evidence of this: people struggle to make even simple steganography appear in RL setups which incentivize it the most. Maybe simple encodings can bootstrap to more complex encodings and this bypasses some of the chicken-and-egg issues, but I’m not sure. Chicken-and-egg problems also mean you benefit a lot from having many serial steps of RL when learning an encoding, and while RL can probably be scaled massively in parallel, high latency of generations imply there probably won’t be massive scale ups of the number of serial steps of RL compared to what r1 already did. (This is a hand-wavy argument, it might be incorrect.)

Fabien Roger

Rea­son­ing mod­els don’t always say what they think

Align­ment Fak­ing Re­vis­ited: Im­proved Clas­sifiers and Open Source Extensions

Au­to­mated Re­searchers Can Subtly Sandbag

Reasoning models don’t always say what they think

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Automated Researchers Can Subtly Sandbag