Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Annah
Karma:
116
All
Posts
Comments
New
Top
Old
An information-theoretic study of lying in LLMs
Annah
and
Guillaume Corlouer
2 Aug 2024 10:06 UTC
16
points
0
comments
4
min read
LW
link
Implementing activation steering
Annah
5 Feb 2024 17:51 UTC
68
points
7
comments
7
min read
LW
link
Classifying representations of sparse autoencoders (SAEs)
Annah
17 Nov 2023 13:54 UTC
15
points
6
comments
2
min read
LW
link
Evaluating hidden directions on the utility dataset: classification, steering and removal
Annah
and
shash42
25 Sep 2023 17:19 UTC
25
points
3
comments
7
min read
LW
link
Back to top