RSS

Sycophancy

TagLast edit: Dec 18, 2023, 11:00 PM by Maxime Riché

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

Jun 17, 2024, 6:41 PM
161 points
22 comments8 min readLW link
(arxiv.org)

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

Jan 2, 2024, 12:47 AM
125 points
29 comments8 min readLW link
(arxiv.org)

Eval­u­at­ing LLaMA 3 for poli­ti­cal syco­phancy

alma.liezengaSep 28, 2024, 7:02 PM
2 points
2 comments6 min readLW link

Two new datasets for eval­u­at­ing poli­ti­cal syco­phancy in LLMs

alma.liezengaSep 28, 2024, 6:29 PM
9 points
0 comments9 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

Oct 12, 2024, 2:54 PM
29 points
4 comments7 min readLW link

Disprov­ing the “Peo­ple-Pleas­ing” Hy­poth­e­sis for AI Self-Re­ports of Experience

rifeJan 26, 2025, 3:53 PM
3 points
18 comments12 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina PanicksseryJul 28, 2023, 2:46 AM
122 points
18 comments9 min readLW link1 review

Towards a Science of Evals for Sycophancy

andrejfsantosFeb 1, 2025, 9:17 PM
6 points
0 comments8 min readLW link

An­tag­o­nis­tic AI

XybermancerMar 1, 2024, 6:50 PM
−8 points
1 comment1 min readLW link
No comments.