RSS

Joseph Bloom

Karma: 1,193

I run the White Box Evaluations Team at the UK AI Security Institute. This is primarily a mechanistic interpretability team focussed on estimating and addressing risks associated with deceptive alignment. I’m a MATS 5.0 and ARENA 1.0 Alumni. Previously, I cofounded the AI Safety Research Infrastructure Org Decode Research and conducted independent research into mechanistic interpretability of decision transformers. I studied computational biology and statistics at the University of Melbourne in Australia.

A Selec­tion of Ran­domly Selected SAE Features

Apr 1, 2024, 9:09 AM
109 points
2 comments4 min readLW link

SAE-VIS: An­nounce­ment Post

Mar 31, 2024, 3:30 PM
74 points
8 comments1 min readLW link

An­nounc­ing Neu­ron­pe­dia: Plat­form for ac­cel­er­at­ing re­search into Sparse Autoencoders

Mar 25, 2024, 9:17 PM
93 points
7 comments7 min readLW link

Un­der­stand­ing SAE Fea­tures with the Logit Lens

Mar 11, 2024, 12:16 AM
68 points
0 comments14 min readLW link