RSS

Jailbreak­ing (AIs)

TagLast edit: Sep 29, 2024, 9:17 PM by Raemon

In­ter­pret­ing the effects of Jailbreak Prompts in LLMs

Harsh RajSep 29, 2024, 7:01 PM
8 points
0 comments5 min readLW link

A Poem Is All You Need: Jailbreak­ing ChatGPT, Meta & More

Sharat Jacob JacobOct 29, 2024, 12:41 PM
12 points
0 comments9 min readLW link

Role em­bed­dings: mak­ing au­thor­ship more salient to LLMs

Jan 7, 2025, 8:13 PM
50 points
0 comments8 min readLW link

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

Feb 7, 2025, 3:57 AM
29 points
0 comments10 min readLW link

[Question] Can we con­trol ar­tifi­cial in­tel­li­gence chat­bot by Jailbreak?

آرزو بیرانوندNov 30, 2024, 8:47 PM
1 point
0 comments1 min readLW link

Con­sti­tu­tional Clas­sifiers: Defend­ing against uni­ver­sal jailbreaks (An­thropic Blog)

ArchimedesFeb 4, 2025, 2:55 AM
16 points
1 comment1 min readLW link
(www.anthropic.com)

De­tect­ing out of dis­tri­bu­tion text with sur­prisal and entropy

Sandy FraserJan 28, 2025, 6:46 PM
14 points
4 comments11 min readLW link

Jailbreak­ing ChatGPT and Claude us­ing Web API Con­text Injection

Jaehyuk LimOct 21, 2024, 9:34 PM
4 points
0 comments3 min readLW link

Can Per­sua­sion Break AI Safety? Ex­plor­ing the In­ter­play Between Fine-Tun­ing, At­tacks, and Guardrails

Devina JainFeb 4, 2025, 7:10 PM
3 points
0 comments10 min readLW link

[Question] Us­ing hex to get mur­der ad­vice from GPT-4o

Laurence FreemanNov 13, 2024, 6:30 PM
10 points
5 comments6 min readLW link
No comments.