Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Brendan Murphy
Karma:
50
All
Posts
Comments
New
Top
Old
Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback
Marcus Williams
,
micahcarroll
,
Adhyyan Narang
,
Constantin Weisser
and
Brendan Murphy
7 Nov 2024 15:39 UTC
47
points
6
comments
11
min read
LW
link
GPT-4o Guardrails Gone: Data Poisoning & Jailbreak-Tuning
ChengCheng
,
Brendan Murphy
,
AdamGleave
and
Kellin Pelrine
1 Nov 2024 0:10 UTC
17
points
0
comments
6
min read
LW
link
(far.ai)
Back to top