RSS

Sycophancy

TagLast edit: 18 Dec 2023 23:00 UTC by Maxime Riché

Steer­ing Llama-2 with con­trastive ac­ti­va­tion additions

2 Jan 2024 0:47 UTC
123 points
29 comments8 min readLW link
(arxiv.org)

Sy­co­phancy to sub­ter­fuge: In­ves­ti­gat­ing re­ward tam­per­ing in large lan­guage models

17 Jun 2024 18:41 UTC
161 points
22 comments8 min readLW link
(arxiv.org)

Eval­u­at­ing LLaMA 3 for poli­ti­cal syco­phancy

alma.liezenga28 Sep 2024 19:02 UTC
2 points
2 comments6 min readLW link

Two new datasets for eval­u­at­ing poli­ti­cal syco­phancy in LLMs

alma.liezenga28 Sep 2024 18:29 UTC
8 points
0 comments9 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Panickssery28 Jul 2023 2:46 UTC
122 points
17 comments9 min readLW link

SAE fea­tures for re­fusal and syco­phancy steer­ing vectors

12 Oct 2024 14:54 UTC
26 points
4 comments7 min readLW link

An­tag­o­nis­tic AI

Xybermancer1 Mar 2024 18:50 UTC
−8 points
1 comment1 min readLW link
No comments.