Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Fabien Roger
Karma:
5,254
All
Posts
Comments
New
Top
Old
Page
1
Modifying LLM Beliefs with Synthetic Document Finetuning
RowanWang
,
Johannes Treutlein
,
Avery
,
Ethan Perez
,
Fabien Roger
and
Sam Marks
Apr 24, 2025, 9:15 PM
69
points
11
comments
2
min read
LW
link
(alignment.anthropic.com)
Reasoning models don’t always say what they think
Joe Benton
,
Ethan Perez
,
Vlad Mikulik
and
Fabien Roger
Apr 9, 2025, 7:48 PM
28
points
4
comments
1
min read
LW
link
(www.anthropic.com)
Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes
,
abhayesian
,
Akbir Khan
and
Fabien Roger
Apr 8, 2025, 5:32 PM
146
points
20
comments
12
min read
LW
link
Automated Researchers Can Subtly Sandbag
gasteigerjo
,
Akbir Khan
,
Sam Bowman
,
Vlad Mikulik
,
Ethan Perez
and
Fabien Roger
Mar 26, 2025, 7:13 PM
44
points
0
comments
4
min read
LW
link
(alignment.anthropic.com)
Auditing language models for hidden objectives
Sam Marks
,
Johannes Treutlein
,
dmz
,
Sam Bowman
,
Hoagy
,
Carson Denison
,
Kei
,
7vik
,
Akbir Khan
,
Austin Meek
,
Euan Ong
,
Christopher Olah
,
Fabien Roger
,
jeanne_
,
Meg
,
Drake Thomas
,
Adam Jermyn
,
Monte M
and
evhub
Mar 13, 2025, 7:18 PM
141
points
15
comments
13
min read
LW
link
Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Fabien Roger
Mar 11, 2025, 11:52 AM
121
points
23
comments
11
min read
LW
link
(alignment.anthropic.com)
Fuzzing LLMs sometimes makes them reveal their secrets
Fabien Roger
Feb 26, 2025, 4:48 PM
61
points
13
comments
9
min read
LW
link
How to replicate and extend our alignment faking demo
Fabien Roger
Dec 19, 2024, 9:44 PM
113
points
5
comments
2
min read
LW
link
(alignment.anthropic.com)
Alignment Faking in Large Language Models
ryan_greenblatt
,
evhub
,
Carson Denison
,
Benjamin Wright
,
Fabien Roger
,
Monte M
,
Sam Marks
,
Johannes Treutlein
,
Sam Bowman
and
Buck
Dec 18, 2024, 5:19 PM
483
points
75
comments
10
min read
LW
link
A toy evaluation of inference code tampering
Fabien Roger
Dec 9, 2024, 5:43 PM
52
points
0
comments
9
min read
LW
link
(alignment.anthropic.com)
The case for unlearning that removes information from LLM weights
Fabien Roger
Oct 14, 2024, 2:08 PM
96
points
18
comments
6
min read
LW
link
[Question]
Is cybercrime really costing trillions per year?
Fabien Roger
Sep 27, 2024, 8:44 AM
63
points
28
comments
1
min read
LW
link
An issue with training schemers with supervised fine-tuning
Fabien Roger
Jun 27, 2024, 3:37 PM
49
points
12
comments
6
min read
LW
link
Best-of-n with misaligned reward models for Math reasoning
Fabien Roger
Jun 21, 2024, 10:53 PM
25
points
0
comments
3
min read
LW
link
Memorizing weak examples can elicit strong behavior out of password-locked models
Fabien Roger
and
ryan_greenblatt
Jun 6, 2024, 11:54 PM
58
points
5
comments
7
min read
LW
link
[Paper] Stress-testing capability elicitation with password-locked models
Fabien Roger
and
ryan_greenblatt
Jun 4, 2024, 2:52 PM
85
points
10
comments
12
min read
LW
link
(arxiv.org)
Open consultancy: Letting untrusted AIs choose what answer to argue for
Fabien Roger
Mar 12, 2024, 8:38 PM
35
points
5
comments
5
min read
LW
link
Fabien’s Shortform
Fabien Roger
Mar 5, 2024, 6:58 PM
6
points
114
comments
1
min read
LW
link
Notes on control evaluations for safety cases
ryan_greenblatt
,
Buck
and
Fabien Roger
Feb 28, 2024, 4:15 PM
49
points
0
comments
32
min read
LW
link
Protocol evaluations: good analogies vs control
Fabien Roger
Feb 19, 2024, 6:00 PM
42
points
10
comments
11
min read
LW
link
Back to top
Next
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel