Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Monte M
Karma:
1,857
All
Posts
Comments
New
Top
Old
Auditing language models for hidden objectives
Sam Marks
,
Johannes Treutlein
,
dmz
,
Sam Bowman
,
Hoagy
,
Carson Denison
,
Kei
,
7vik
,
Akbir Khan
,
Austin Meek
,
Euan Ong
,
Christopher Olah
,
Fabien Roger
,
jeanne_
,
Meg
,
Drake Thomas
,
Adam Jermyn
,
Monte M
and
evhub
Mar 13, 2025, 7:18 PM
141
points
15
comments
13
min read
LW
link
Alignment Faking in Large Language Models
ryan_greenblatt
,
evhub
,
Carson Denison
,
Benjamin Wright
,
Fabien Roger
,
Monte M
,
Sam Marks
,
Johannes Treutlein
,
Sam Bowman
and
Buck
Dec 18, 2024, 5:19 PM
483
points
75
comments
10
min read
LW
link
Simple probes can catch sleeper agents
Monte M
,
Carson Denison
,
Zac Hatfield-Dodds
,
David Duvenaud
,
Sam Bowman
,
Ethan Perez
and
evhub
Apr 23, 2024, 9:10 PM
133
points
21
comments
1
min read
LW
link
(www.anthropic.com)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub
,
Carson Denison
,
Meg
,
Monte M
,
David Duvenaud
,
Nicholas Schiefer
and
Ethan Perez
Jan 12, 2024, 7:51 PM
305
points
95
comments
3
min read
LW
link
(arxiv.org)
Paper: Understanding and Controlling a Maze-Solving Policy Network
TurnTrout
,
Ulisse Mini
,
peligrietzer
,
mrinank_sharma
,
Austin Meek
,
Monte M
and
lisathiergart
Oct 13, 2023, 1:38 AM
70
points
0
comments
1
min read
LW
link
(arxiv.org)
ActAdd: Steering Language Models without Optimization
technicalities
,
TurnTrout
,
lisathiergart
,
David Udell
,
Ulisse Mini
and
Monte M
Sep 6, 2023, 5:21 PM
105
points
3
comments
2
min read
LW
link
(arxiv.org)
Open problems in activation engineering
TurnTrout
,
woog
,
lisathiergart
,
Monte M
and
Ulisse Mini
Jul 24, 2023, 7:46 PM
51
points
2
comments
1
min read
LW
link
(coda.io)
Steering GPT-2-XL by adding an activation vector
TurnTrout
,
Monte M
,
David Udell
,
lisathiergart
and
Ulisse Mini
May 13, 2023, 6:42 PM
437
points
98
comments
50
min read
LW
link
1
review
Understanding and controlling a maze-solving policy network
TurnTrout
,
peligrietzer
,
Ulisse Mini
,
Monte M
and
David Udell
Mar 11, 2023, 6:59 PM
333
points
28
comments
23
min read
LW
link
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel