Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
bilalchughtai
Karma:
268
All
Posts
Comments
New
Top
Old
bilalchughtai’s Shortform
bilalchughtai
29 Jul 2024 18:57 UTC
3
points
5
comments
1
min read
LW
link
Understanding Positional Features in Layer 0 SAEs
bilalchughtai
and
Yeu-Tong Lau
29 Jul 2024 9:36 UTC
43
points
0
comments
5
min read
LW
link
Unlearning via RMU is mostly shallow
Andy Arditi
and
bilalchughtai
23 Jul 2024 16:07 UTC
50
points
3
comments
6
min read
LW
link
Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller
,
bilalchughtai
and
William_S
12 Jul 2024 3:47 UTC
104
points
5
comments
7
min read
LW
link
(arxiv.org)
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L
,
bilalchughtai
,
jan betley
,
kaivu
,
Jérémy Scheurer
,
Mikita Balesni
,
AlexMeinke
,
Owain_Evans
and
Marius Hobbhahn
8 Jul 2024 22:24 UTC
103
points
28
comments
5
min read
LW
link
Back to top