Sycophancy

TagLast edit: Dec 18, 2023, 11:00 PM by Maxime Riché

Steering Llama-2 with contrastive activation additions

Nina Panickssery, Wuschel Schulz, NickGabs, Meg, evhub and TurnTrout

Jan 2, 2024, 12:47 AM

125 points

29 comments8 min readLW link

(arxiv.org)

Sycophancy to subterfuge: Investigating reward tampering in large language models

Carson Denison and evhub

Jun 17, 2024, 6:41 PM

161 points

22 comments8 min readLW link

(arxiv.org)

Reducing sycophancy and improving honesty via activation steering

Nina PanicksseryJul 28, 2023, 2:46 AM

122 points

18 comments9 min readLW link 1 review

Antagonistic AI

XybermancerMar 1, 2024, 6:50 PM

−8 points

1 comment1 min readLW link

Towards a Science of Evals for Sycophancy

andrejfsantosFeb 1, 2025, 9:17 PM

6 points

0 comments8 min readLW link

Evaluating LLaMA 3 for political sycophancy

alma.liezengaSep 28, 2024, 7:02 PM

2 points

2 comments6 min readLW link

Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience

rifeJan 26, 2025, 3:53 PM

3 points

18 comments12 min readLW link

Two new datasets for evaluating political sycophancy in LLMs

alma.liezengaSep 28, 2024, 6:29 PM

9 points

0 comments9 min readLW link

SAE features for refusal and sycophancy steering vectors

neverix, Dmitrii Kharlapenko, Arthur Conmy and Neel Nanda

Oct 12, 2024, 2:54 PM

29 points

4 comments7 min readLW link

No comments.