RSS

Sam Marks

Karma: 2,580

Recom­men­da­tions for Tech­ni­cal AI Safety Re­search Directions

Sam Marks10 Jan 2025 19:34 UTC
64 points
1 comment17 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

18 Dec 2024 17:19 UTC
476 points
68 comments10 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

11 Dec 2024 6:30 UTC
80 points
6 comments2 min readLW link
(www.neuronpedia.org)