RSS

Andy Arditi

Karma: 416

https://​​andyrdt.com

Un­learn­ing via RMU is mostly shallow

23 Jul 2024 16:07 UTC
50 points
3 comments6 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
228 points
93 comments10 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

8 Dec 2023 17:08 UTC
81 points
7 comments7 min readLW link