Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Jailbreaking (AIs)
Tag
Last edit:
Sep 29, 2024, 9:17 PM
by
Raemon
Relevant
New
Old
Interpreting the effects of Jailbreak Prompts in LLMs
Harsh Raj
Sep 29, 2024, 7:01 PM
8
points
0
comments
5
min read
LW
link
A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More
Sharat Jacob Jacob
Oct 29, 2024, 12:41 PM
12
points
0
comments
9
min read
LW
link
Role embeddings: making authorship more salient to LLMs
Nina Panickssery
and
Christopher Ackerman
Jan 7, 2025, 8:13 PM
50
points
0
comments
8
min read
LW
link
Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google
ChengCheng
,
Brendan Murphy
,
Adrià Garriga-alonso
,
Yashvardhan Sharma
,
dsbowen
,
smallsilo
,
Yawen Duan
,
ChrisCundy
,
Hannah Betts
,
AdamGleave
and
Kellin Pelrine
Feb 7, 2025, 3:57 AM
29
points
0
comments
10
min read
LW
link
[Question]
Can we control artificial intelligence chatbot by Jailbreak?
آرزو بیرانوند
Nov 30, 2024, 8:47 PM
1
point
0
comments
1
min read
LW
link
Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)
Archimedes
Feb 4, 2025, 2:55 AM
16
points
1
comment
1
min read
LW
link
(www.anthropic.com)
Detecting out of distribution text with surprisal and entropy
Sandy Fraser
Jan 28, 2025, 6:46 PM
14
points
4
comments
11
min read
LW
link
Jailbreaking ChatGPT and Claude using Web API Context Injection
Jaehyuk Lim
Oct 21, 2024, 9:34 PM
4
points
0
comments
3
min read
LW
link
Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails
Devina Jain
Feb 4, 2025, 7:10 PM
3
points
0
comments
10
min read
LW
link
[Question]
Using hex to get murder advice from GPT-4o
Laurence Freeman
Nov 13, 2024, 6:30 PM
10
points
5
comments
6
min read
LW
link
No comments.
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel