RSS

Simon Lermen

Karma: 792

Twitter: @SimonLermenAI

Hu­man study on AI spear phish­ing campaigns

Jan 3, 2025, 3:11 PM
79 points
8 comments5 min readLW link

Cur­rent safety train­ing tech­niques do not fully trans­fer to the agent setting

Nov 3, 2024, 7:24 PM
158 points
9 comments5 min readLW link

De­cep­tive agents can col­lude to hide dan­ger­ous fea­tures in SAEs

Jul 15, 2024, 5:07 PM
33 points
2 comments7 min readLW link

Ap­ply­ing re­fusal-vec­tor ab­la­tion to a Llama 3 70B agent

Simon LermenMay 11, 2024, 12:08 AM
51 points
14 comments7 min readLW link

Creat­ing un­re­stricted AI Agents with Com­mand R+

Simon LermenApr 16, 2024, 2:52 PM
77 points
13 comments5 min readLW link

unRLHF—Effi­ciently un­do­ing LLM safeguards

Oct 12, 2023, 7:58 PM
117 points
15 comments20 min readLW link

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

Oct 12, 2023, 7:58 PM
151 points
29 comments14 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

Jul 15, 2023, 7:12 PM
47 points
5 comments9 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

May 16, 2023, 10:53 AM
26 points
0 comments13 min readLW link