RSS

cloud

Karma: 1,001

Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT (Re­search Note)

Jul 30, 2025, 9:26 PM
179 points
20 comments6 min readLW link

Sublimi­nal Learn­ing: LLMs Trans­mit Be­hav­ioral Traits via Hid­den Sig­nals in Data

Jul 22, 2025, 4:37 PM
306 points
25 comments4 min readLW link

Selec­tive Gen­er­al­iza­tion: Im­prov­ing Ca­pa­bil­ities While Main­tain­ing Alignment

Jul 16, 2025, 9:25 PM
63 points
4 comments7 min readLW link

Distil­la­tion Ro­bus­tifies Unlearning

Jun 13, 2025, 1:45 PM
232 points
43 comments8 min readLW link
(arxiv.org)

Selec­tive mod­u­lar­ity: a re­search agenda

Mar 24, 2025, 4:12 AM
66 points
2 comments24 min readLW link

[Question] Is weak-to-strong gen­er­al­iza­tion an al­ign­ment tech­nique?

cloudJan 31, 2025, 7:13 AM
22 points
1 comment2 min readLW link

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

Dec 6, 2024, 10:19 PM
169 points
12 comments11 min readLW link
(arxiv.org)