RSS

Henry Sleight

Karma: 324

Best-of-N Jailbreaking

Dec 14, 2024, 4:58 AM
78 points
5 comments2 min readLW link
(arxiv.org)

Re­ward hack­ing be­hav­ior can gen­er­al­ize across tasks

May 28, 2024, 4:33 PM
79 points
5 comments21 min readLW link

MATS Win­ter 2023-24 Retrospective

May 11, 2024, 12:09 AM
86 points
28 comments49 min readLW link

In­duc­ing Un­prompted Misal­ign­ment in LLMs

Apr 19, 2024, 8:00 PM
38 points
7 comments16 min readLW link

How I se­lect al­ign­ment re­search projects

Apr 10, 2024, 4:33 AM
35 points
4 comments24 min readLW link

Tem­plates I made to run feed­back rounds for Ethan Perez’s re­search fel­lows.

Henry SleightMar 28, 2024, 7:41 PM
33 points
0 comments10 min readLW link

Read­ing writ­ing ad­vice doesn’t make writ­ing easier

Henry SleightFeb 7, 2024, 7:14 PM
17 points
0 comments5 min readLW link
(open.substack.com)