RSS

Mike Vaiana

Karma: 545

Mis­tral Large 2 (123B) ex­hibits al­ign­ment faking

Mar 27, 2025, 3:39 PM
80 points
4 comments13 min readLW link

Re­duc­ing LLM de­cep­tion at scale with self-other over­lap fine-tuning

Mar 13, 2025, 7:09 PM
155 points
40 comments6 min readLW link

Self-pre­dic­tion acts as an emer­gent regularizer

Oct 23, 2024, 10:27 PM
91 points
9 comments4 min readLW link

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

Jul 30, 2024, 4:22 PM
215 points
51 comments12 min readLW link

Video In­tro to Guaran­teed Safe AI

Jul 11, 2024, 5:53 PM
27 points
0 comments1 min readLW link
(youtu.be)

DIY RLHF: A sim­ple im­ple­men­ta­tion for hands on experience

Jul 10, 2024, 12:07 PM
28 points
0 comments6 min readLW link