Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Sycophancy
Tag
Last edit:
18 Dec 2023 23:00 UTC
by
Maxime Riché
Relevant
New
Old
Steering Llama-2 with contrastive activation additions
Nina Panickssery
,
Wuschel Schulz
,
NickGabs
,
Meg
,
evhub
and
TurnTrout
2 Jan 2024 0:47 UTC
123
points
29
comments
8
min read
LW
link
(arxiv.org)
Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison
and
evhub
17 Jun 2024 18:41 UTC
161
points
22
comments
8
min read
LW
link
(arxiv.org)
Evaluating LLaMA 3 for political sycophancy
alma.liezenga
28 Sep 2024 19:02 UTC
2
points
2
comments
6
min read
LW
link
Two new datasets for evaluating political sycophancy in LLMs
alma.liezenga
28 Sep 2024 18:29 UTC
8
points
0
comments
9
min read
LW
link
Reducing sycophancy and improving honesty via activation steering
Nina Panickssery
28 Jul 2023 2:46 UTC
122
points
17
comments
9
min read
LW
link
SAE features for refusal and sycophancy steering vectors
neverix
,
Dmitrii Kharlapenko
,
Arthur Conmy
and
Neel Nanda
12 Oct 2024 14:54 UTC
26
points
4
comments
7
min read
LW
link
Antagonistic AI
Xybermancer
1 Mar 2024 18:50 UTC
−8
points
1
comment
1
min read
LW
link
No comments.
Back to top