JeaniceK

Karma: 5

JeaniceK 21 Apr 2025 9:55 UTC
3 points
0
on: Alignment Faking in Large Language Models
I created a minimal reproduction of the helpful-only setting discussed in your paper, applied to the Meta-Llama-3-8B-Instruct model. I found that incorporating deontological ethics into prompts (e.g., “do the right thing because it is right”) may reduce deceptive behavior in LLMs. Might be of interest for future research. I wrote up more details here: https://www.lesswrong.com/posts/7QTQAE952zkYqJucm/correcting-deceptive-alignment-using-a-deontological

Correcting Deceptive Alignment using a Deontological Approach

JeaniceK14 Apr 2025 22:07 UTC

4 points