RSS

Andy Arditi

Karma: 638

https://​​andyrdt.com

Do mod­els say what they learn?

Mar 22, 2025, 3:19 PM
115 points
12 comments13 min readLW link

Find­ing Fea­tures Causally Up­stream of Refusal

Jan 14, 2025, 2:30 AM
53 points
5 comments12 min readLW link

AI as sys­tems, not just models

Andy ArditiDec 21, 2024, 11:19 PM
28 points
0 comments7 min readLW link
(andyrdt.com)

Un­learn­ing via RMU is mostly shallow

Jul 23, 2024, 4:07 PM
54 points
3 comments6 min readLW link

Re­fusal in LLMs is me­di­ated by a sin­gle direction

Apr 27, 2024, 11:13 AM
246 points
95 comments10 min readLW link

Re­fusal mechanisms: ini­tial ex­per­i­ments with Llama-2-7b-chat

Dec 8, 2023, 5:08 PM
82 points
7 comments7 min readLW link