Buck comments on Alignment Faking in Large Language Models

Buck 18 Dec 2024 20:21 UTC
LW: 10 AF: 7
8
AF
This concern is why we are interested in AI control.
- Satron 18 Dec 2024 20:39 UTC
  3 points
  2
  AF Parent
  Have there been any proposals for detecting alignment faking LLMs in AI control literature?
  - ryan_greenblatt 18 Dec 2024 20:50 UTC
    LW: 12 AF: 8
    6
    AF Parent
    Roughly speaking, the hope is to make is so the AI can’t do something very problematic without having a very high chance of being caught (or stopped). This doesn’t guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
    
    We do explore this type of setting in prior work, e.g. here.