RSS

Sam Marks

Karma: 2,571

Recom­men­da­tions for Tech­ni­cal AI Safety Re­search Directions

Sam MarksJan 10, 2025, 7:34 PM
64 points
1 comment17 min readLW link
(alignment.anthropic.com)

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
471 points
67 comments10 min readLW link

SAEBench: A Com­pre­hen­sive Bench­mark for Sparse Autoencoders

Dec 11, 2024, 6:30 AM
78 points
2 comments2 min readLW link
(www.neuronpedia.org)

Eval­u­at­ing Sparse Au­toen­coders with Board Game Models

Aug 2, 2024, 7:50 PM
38 points
1 comment9 min readLW link

Discrim­i­nat­ing Be­hav­iorally Iden­ti­cal Clas­sifiers: a model prob­lem for ap­ply­ing in­ter­pretabil­ity to scal­able oversight

Sam MarksApr 18, 2024, 4:17 PM
107 points
10 comments12 min readLW link

What’s up with LLMs rep­re­sent­ing XORs of ar­bi­trary fea­tures?

Sam MarksJan 3, 2024, 7:44 PM
157 points
61 comments16 min readLW link

Some open-source dic­tio­nar­ies and dic­tio­nary learn­ing infrastructure

Sam MarksDec 5, 2023, 6:05 AM
46 points
7 comments5 min readLW link

Thoughts on open source AI

Sam MarksNov 3, 2023, 3:35 PM
62 points
17 comments10 min readLW link

Turn­ing off lights with model editing

Sam MarksMay 12, 2023, 8:25 PM
68 points
5 comments2 min readLW link
(arxiv.org)

[Cross­post] ACX 2022 Pre­dic­tion Con­test Results

Jan 24, 2023, 6:56 AM
46 points
6 comments8 min readLW link

AGISF adap­ta­tion for in-per­son groups

Jan 13, 2023, 3:24 AM
44 points
2 comments3 min readLW link

Up­date on Har­vard AI Safety Team and MIT AI Alignment

Dec 2, 2022, 12:56 AM
60 points
4 comments8 min readLW link

Recom­mend HAIST re­sources for as­sess­ing the value of RLHF-re­lated al­ign­ment research

Nov 5, 2022, 8:58 PM
26 points
9 comments3 min readLW link

Cau­tion when in­ter­pret­ing Deep­mind’s In-con­text RL paper

Sam MarksNov 1, 2022, 2:42 AM
105 points
8 comments4 min readLW link

Safety con­sid­er­a­tions for on­line gen­er­a­tive modeling

Sam MarksJul 7, 2022, 6:31 PM
42 points
9 comments14 min readLW link

Proxy mis­speci­fi­ca­tion and the ca­pa­bil­ities vs. value learn­ing race

Sam MarksMay 16, 2022, 6:58 PM
23 points
3 comments4 min readLW link

If you’re very op­ti­mistic about ELK then you should be op­ti­mistic about outer alignment

Sam MarksApr 27, 2022, 7:30 PM
17 points
8 comments3 min readLW link

Sam Marks’s Shortform

Sam MarksApr 13, 2022, 9:38 PM
3 points
45 comments1 min readLW link

2022 ACX pre­dic­tions: mar­ket prices

Sam MarksMar 6, 2022, 6:24 AM
21 points
2 comments5 min readLW link

Movie re­view: Don’t Look Up

Sam MarksJan 4, 2022, 8:16 PM
35 points
6 comments11 min readLW link