Annah

Karma: 116

An information-theoretic study of lying in LLMs

Annah and Guillaume Corlouer

2 Aug 2024 10:06 UTC

16 points

0 comments4 min readLW link

Implementing activation steering

Annah5 Feb 2024 17:51 UTC

68 points

7 comments7 min readLW link

Classifying representations of sparse autoencoders (SAEs)

Annah17 Nov 2023 13:54 UTC

15 points

6 comments2 min readLW link

Evaluating hidden directions on the utility dataset: classification, steering and removal

Annah and shash42

25 Sep 2023 17:19 UTC

25 points

3 comments7 min readLW link