RSS

bilalchughtai

Karma: 919

My website is here.

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
102 points
9 comments2 min readLW link
(arxiv.org)

Paper: Open Prob­lems in Mechanis­tic Interpretability

Jan 29, 2025, 10:25 AM
68 points
0 comments1 min readLW link
(arxiv.org)

Ac­ti­va­tion space in­ter­pretabil­ity may be doomed

Jan 8, 2025, 12:49 PM
147 points
32 comments8 min readLW link

Rea­sons for and against work­ing on tech­ni­cal AI safety at a fron­tier AI lab

bilalchughtaiJan 5, 2025, 2:49 PM
97 points
12 comments12 min readLW link

Book Sum­mary: Zero to One

bilalchughtaiDec 29, 2024, 4:13 PM
27 points
2 comments8 min readLW link

Remap your caps lock key

bilalchughtaiDec 15, 2024, 2:03 PM
80 points
18 comments1 min readLW link

You should con­sider ap­ply­ing to PhDs (soon!)

bilalchughtaiNov 29, 2024, 8:33 PM
114 points
19 comments6 min readLW link

bilalchugh­tai’s Shortform

bilalchughtaiJul 29, 2024, 6:57 PM
5 points
10 commentsLW link

Un­der­stand­ing Po­si­tional Fea­tures in Layer 0 SAEs

Jul 29, 2024, 9:36 AM
43 points
0 comments5 min readLW link

Un­learn­ing via RMU is mostly shallow

Jul 23, 2024, 4:07 PM
54 points
3 comments6 min readLW link

Trans­former Cir­cuit Faith­ful­ness Met­rics Are Not Robust

Jul 12, 2024, 3:47 AM
104 points
5 comments7 min readLW link
(arxiv.org)

Me, My­self, and AI: the Si­tu­a­tional Aware­ness Dataset (SAD) for LLMs

Jul 8, 2024, 10:24 PM
109 points
37 comments5 min readLW link