RSS

Sam Bowman

Karma: 1,688

https://​​cims.nyu.edu/​​~sbowman/​​

Au­dit­ing lan­guage mod­els for hid­den objectives

Mar 13, 2025, 7:18 PM
137 points
7 comments13 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
482 points
74 comments10 min readLW link

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Oct 18, 2024, 10:33 PM
94 points
56 comments6 min readLW link
(assets.anthropic.com)

The Check­list: What Suc­ceed­ing at AI Safety Will In­volve

Sam BowmanSep 3, 2024, 6:18 PM
149 points
49 comments22 min readLW link
(sleepinyourhat.github.io)

Sim­ple probes can catch sleeper agents

Apr 23, 2024, 9:10 PM
133 points
21 comments1 min readLW link
(www.anthropic.com)

LLM Eval­u­a­tors Rec­og­nize and Fa­vor Their Own Generations

Apr 17, 2024, 9:09 PM
44 points
1 comment3 min readLW link
(tiny.cc)

De­bat­ing with More Per­sua­sive LLMs Leads to More Truth­ful Answers

Feb 7, 2024, 9:28 PM
88 points
14 comments9 min readLW link
(arxiv.org)

Mea­sur­ing and Im­prov­ing the Faith­ful­ness of Model-Gen­er­ated Rea­son­ing

Jul 18, 2023, 4:36 PM
111 points
15 comments6 min readLW link1 review

Pre­train­ing Lan­guage Models with Hu­man Preferences

Feb 21, 2023, 5:57 PM
135 points
20 comments11 min readLW link2 reviews

In­verse Scal­ing Prize: Se­cond Round Winners

Jan 24, 2023, 8:12 PM
58 points
17 comments15 min readLW link

AI Safety and Neigh­bor­ing Com­mu­ni­ties: A Quick-Start Guide, as of Sum­mer 2022

Sam BowmanSep 1, 2022, 7:15 PM
76 points
2 comments7 min readLW link

Sur­vey of NLP Re­searchers: NLP is con­tribut­ing to AGI progress; ma­jor catas­tro­phe plausible

Sam BowmanAug 31, 2022, 1:39 AM
91 points
6 comments2 min readLW link

Ar­tifi­cial Sand­wich­ing: When can we test scal­able al­ign­ment pro­to­cols with­out hu­mans?

Sam BowmanJul 13, 2022, 9:14 PM
42 points
6 comments5 min readLW link

An­nounc­ing the In­verse Scal­ing Prize ($250k Prize Pool)

Jun 27, 2022, 3:58 PM
171 points
14 comments7 min readLW link

Jobs: Help scale up LM al­ign­ment re­search at NYU

Sam BowmanMay 9, 2022, 2:12 PM
60 points
1 comment1 min readLW link

A Small Nega­tive Re­sult on Debate

Sam BowmanApr 12, 2022, 6:19 PM
42 points
11 comments1 min readLW link

NLP Po­si­tion Paper: When Com­bat­ting Hype, Pro­ceed with Caution

Sam BowmanOct 15, 2021, 8:57 PM
46 points
14 comments1 min readLW link