Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
mishajw
Karma:
137
HTTP/1.1 303 See other Location:
https://mishajw.com
All
Posts
Comments
New
Top
Old
Sabotage Evaluations for Frontier Models
David Duvenaud
,
Joe Benton
,
Sam Bowman
,
evhub
,
mishajw
,
Eric Christiansen
,
HoldenKarnofsky
,
Ethan Perez
and
Buck
18 Oct 2024 22:33 UTC
93
points
55
comments
6
min read
LW
link
(assets.anthropic.com)
How well do truth probes generalise?
mishajw
24 Feb 2024 14:12 UTC
87
points
11
comments
9
min read
LW
link
Jailbreaking GPT-4 with the tool API
mishajw
21 Feb 2024 11:16 UTC
20
points
2
comments
4
min read
LW
link
Distilled Representations Research Agenda
Hoagy
and
mishajw
18 Oct 2022 20:59 UTC
15
points
2
comments
8
min read
LW
link
Back to top