RSS

Marc Carauleanu

Karma: 637

AI Safety Researcher @AE Studio

Currently researching a neglected prior for cooperation and honesty inspired by the cognitive neuroscience of altruism called self-other overlap in state-of-the-art ML models.

Previous SRF @ SERI 21′, MLSS & Student Researcher @ CAIS 22′ and LTFF grantee.

LinkedIn

Mis­tral Large 2 (123B) ex­hibits al­ign­ment faking

Mar 27, 2025, 3:39 PM
79 points
4 comments13 min readLW link

Re­duc­ing LLM de­cep­tion at scale with self-other over­lap fine-tuning

Mar 13, 2025, 7:09 PM
153 points
40 comments6 min readLW link