Jailbreaking (AIs)

TagLast edit: Sep 29, 2024, 9:17 PM by Raemon

Interpreting the effects of Jailbreak Prompts in LLMs

Harsh RajSep 29, 2024, 7:01 PM

8 points

0 comments5 min readLW link

A Poem Is All You Need: Jailbreaking ChatGPT, Meta & More

Sharat Jacob JacobOct 29, 2024, 12:41 PM

12 points

0 comments9 min readLW link

Role embeddings: making authorship more salient to LLMs

Nina Panickssery and Christopher Ackerman

Jan 7, 2025, 8:13 PM

50 points

0 comments8 min readLW link

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

Feb 7, 2025, 3:57 AM

29 points

0 comments10 min readLW link

Can Persuasion Break AI Safety? Exploring the Interplay Between Fine-Tuning, Attacks, and Guardrails

Devina JainFeb 4, 2025, 7:10 PM

3 points

0 comments10 min readLW link

[Question] Using hex to get murder advice from GPT-4o

Laurence FreemanNov 13, 2024, 6:30 PM

10 points

5 comments6 min readLW link

Constitutional Classifiers: Defending against universal jailbreaks (Anthropic Blog)

ArchimedesFeb 4, 2025, 2:55 AM

16 points

1 comment1 min readLW link

(www.anthropic.com)

Detecting out of distribution text with surprisal and entropy

Sandy FraserJan 28, 2025, 6:46 PM

14 points

4 comments11 min readLW link

Jailbreaking ChatGPT and Claude using Web API Context Injection

Jaehyuk LimOct 21, 2024, 9:34 PM

4 points

0 comments3 min readLW link

[Question] Can we control artificial intelligence chatbot by Jailbreak?

آرزو بیرانوندNov 30, 2024, 8:47 PM

1 point

0 comments1 min readLW link

No comments.