Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
dmz
Karma:
494
All
Posts
Comments
New
Top
Old
Auditing language models for hidden objectives
Sam Marks
,
Johannes Treutlein
,
dmz
,
Sam Bowman
,
Hoagy
,
Carson Denison
,
Kei
,
7vik
,
Akbir Khan
,
Austin Meek
,
Euan Ong
,
Christopher Olah
,
Fabien Roger
,
jeanne_
,
Meg
,
Drake Thomas
,
Adam Jermyn
,
Monte M
and
evhub
Mar 13, 2025, 7:18 PM
138
points
15
comments
13
min read
LW
link
Takeaways from our robust injury classifier project [Redwood Research]
dmz
Sep 17, 2022, 3:55 AM
143
points
12
comments
6
min read
LW
link
1
review
High-stakes alignment via adversarial training [Redwood Research report]
dmz
,
LawrenceC
and
Nate Thomas
May 5, 2022, 12:59 AM
142
points
29
comments
9
min read
LW
link
Some criteria for sandwiching projects
dmz
Aug 12, 2021, 3:40 AM
18
points
1
comment
4
min read
LW
link
DMZ’s Shortform
dmz
Aug 9, 2021, 1:18 AM
1
point
1
comment
LW
link
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel