Sam Marks

Karma: 3,189

Modifying LLM Beliefs with Synthetic Document Finetuning

RowanWang, Johannes Treutlein, Avery, Ethan Perez, Fabien Roger and Sam Marks

Apr 24, 2025, 9:15 PM

70 points

12 comments2 min readLW link

(alignment.anthropic.com)

Downstream applications as validation of interpretability progress

Sam MarksMar 31, 2025, 1:35 AM

112 points

3 comments7 min readLW link

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

141 points

15 comments13 min readLW link

Recommendations for Technical AI Safety Research Directions

Sam MarksJan 10, 2025, 7:34 PM

64 points

1 comment17 min readLW link

(alignment.anthropic.com)

Alignment Faking in Large Language Models

ryan_greenblatt, evhub, Carson Denison, Benjamin Wright, Fabien Roger, Monte M, Sam Marks, Johannes Treutlein, Sam Bowman and Buck

Dec 18, 2024, 5:19 PM

483 points

75 comments10 min readLW link

SAEBench: A Comprehensive Benchmark for Sparse Autoencoders

Can, Adam Karvonen, Johnny Lin, Curt Tigges, Joseph Bloom, chanind, Yeu-Tong Lau, Eoin Farrell, Arthur Conmy, CallumMcDougall, Kola Ayonrinde, Matthew Wearden, Sam Marks and Neel Nanda

Dec 11, 2024, 6:30 AM

82 points

6 comments2 min readLW link

(www.neuronpedia.org)

Evaluating Sparse Autoencoders with Board Game Models

Adam Karvonen, Sam Marks, Can, Benjamin Wright, Jannik Brinkmann, Logan Riggs and Rico Angell

Aug 2, 2024, 7:50 PM

38 points

1 comment9 min readLW link

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Sam MarksApr 18, 2024, 4:17 PM

113 points

10 comments12 min readLW link

What’s up with LLMs representing XORs of arbitrary features?

Sam MarksJan 3, 2024, 7:44 PM

158 points

63 comments16 min readLW link

Some open-source dictionaries and dictionary learning infrastructure

Sam MarksDec 5, 2023, 6:05 AM

46 points

7 comments5 min readLW link

Thoughts on open source AI

Sam MarksNov 3, 2023, 3:35 PM

62 points

17 comments10 min readLW link

Turning off lights with model editing

Sam MarksMay 12, 2023, 8:25 PM

68 points

5 comments2 min readLW link

(arxiv.org)

[Crosspost] ACX 2022 Prediction Contest Results

Scott Alexander, Eric Neyman and Sam Marks

Jan 24, 2023, 6:56 AM

48 points

6 comments8 min readLW link

AGISF adaptation for in-person groups

Sam Marks, Xander Davies and Richard_Ngo

Jan 13, 2023, 3:24 AM

44 points

2 comments3 min readLW link

Update on Harvard AI Safety Team and MIT AI Alignment

Xander Davies, Sam Marks, kaivu, tlevin, eleni, maxnadeau and Naomi Bashkansky

Dec 2, 2022, 12:56 AM

60 points

4 comments8 min readLW link

Recommend HAIST resources for assessing the value of RLHF-related alignment research

Sam Marks and Xander Davies

Nov 5, 2022, 8:58 PM

26 points

9 comments3 min readLW link

Caution when interpreting Deepmind’s In-context RL paper

Sam MarksNov 1, 2022, 2:42 AM

105 points

8 comments4 min readLW link

Safety considerations for online generative modeling

Sam MarksJul 7, 2022, 6:31 PM

42 points

9 comments14 min readLW link

Proxy misspecification and the capabilities vs. value learning race

Sam MarksMay 16, 2022, 6:58 PM

23 points

3 comments4 min readLW link

If you’re very optimistic about ELK then you should be optimistic about outer alignment

Sam MarksApr 27, 2022, 7:30 PM

17 points

8 comments3 min readLW link

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer