RSS

Fazl

Karma: 118

Best-of-N Jailbreaking

Dec 14, 2024, 4:58 AM
78 points
5 comments2 min readLW link
(arxiv.org)

Vi­su­al­iz­ing neu­ral net­work planning

May 9, 2024, 6:40 AM
4 points
0 comments5 min readLW link

Mechanis­tic In­ter­pretabil­ity Work­shop Hap­pen­ing at ICML 2024!

May 3, 2024, 1:18 AM
48 points
6 comments1 min readLW link

Early Ex­per­i­ments in Re­ward Model In­ter­pre­ta­tion Us­ing Sparse Autoencoders

Oct 3, 2023, 7:45 AM
17 points
0 comments5 min readLW link

Au­to­mated Sand­wich­ing & Quan­tify­ing Hu­man-LLM Co­op­er­a­tion: ScaleOver­sight hackathon results

Feb 23, 2023, 10:48 AM
8 points
0 comments6 min readLW link