Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Sycophancy
Tag
Last edit:
Dec 18, 2023, 11:00 PM
by
Maxime Riché
Relevant
New
Old
Sycophancy to subterfuge: Investigating reward tampering in large language models
Carson Denison
and
evhub
Jun 17, 2024, 6:41 PM
161
points
22
comments
8
min read
LW
link
(arxiv.org)
Steering Llama-2 with contrastive activation additions
Nina Panickssery
,
Wuschel Schulz
,
NickGabs
,
Meg
,
evhub
and
TurnTrout
Jan 2, 2024, 12:47 AM
125
points
29
comments
8
min read
LW
link
(arxiv.org)
Evaluating LLaMA 3 for political sycophancy
alma.liezenga
Sep 28, 2024, 7:02 PM
2
points
2
comments
6
min read
LW
link
Two new datasets for evaluating political sycophancy in LLMs
alma.liezenga
Sep 28, 2024, 6:29 PM
9
points
0
comments
9
min read
LW
link
SAE features for refusal and sycophancy steering vectors
neverix
,
Dmitrii Kharlapenko
,
Arthur Conmy
and
Neel Nanda
Oct 12, 2024, 2:54 PM
29
points
4
comments
7
min read
LW
link
Disproving the “People-Pleasing” Hypothesis for AI Self-Reports of Experience
rife
Jan 26, 2025, 3:53 PM
3
points
18
comments
12
min read
LW
link
Reducing sycophancy and improving honesty via activation steering
Nina Panickssery
Jul 28, 2023, 2:46 AM
122
points
18
comments
9
min read
LW
link
1
review
Towards a Science of Evals for Sycophancy
andrejfsantos
Feb 1, 2025, 9:17 PM
6
points
0
comments
8
min read
LW
link
Antagonistic AI
Xybermancer
Mar 1, 2024, 6:50 PM
−8
points
1
comment
1
min read
LW
link
No comments.
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel