RSS

Alex Mallen

Karma: 672

Redwood Research

Re­cent Red­wood Re­search pro­ject proposals

Jul 14, 2025, 10:27 PM
91 points
0 comments3 min readLW link

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

Jul 8, 2025, 9:49 PM
149 points
14 comments5 min readLW link
(arxiv.org)

Alex Mallen’s Shortform

Alex MallenJun 17, 2025, 4:31 PM
4 points
1 commentLW link

A quick list of re­ward hack­ing interventions

Alex MallenJun 10, 2025, 12:58 AM
42 points
5 comments3 min readLW link

The case for coun­ter­mea­sures to memetic spread of mis­al­igned values

Alex MallenMay 28, 2025, 9:12 PM
42 points
1 comment7 min readLW link

Poli­ti­cal syco­phancy as a model or­ganism of scheming

May 12, 2025, 5:49 PM
40 points
0 comments14 min readLW link

Train­ing-time schemers vs be­hav­ioral schemers

Alex MallenApr 24, 2025, 7:07 PM
44 points
9 comments6 min readLW link

Sub­ver­sion Strat­egy Eval: Can lan­guage mod­els state­lessly strate­gize to sub­vert con­trol pro­to­cols?

Mar 24, 2025, 5:55 PM
34 points
0 comments8 min readLW link