ryan_greenblatt comments on Alignment Faking in Large Language Models

ryan_greenblatt 18 Dec 2024 20:48 UTC
LW: 21 AF: 11
16
AF
Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?

My general answer would be:
- AI Control^[1]
- There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusting the architecture to allow less opaque reasoning (and avoiding opaque recurrence).
There are some more speculative and ambitious hopes:
- Train huge numbers of diverse model organisms and develop white box techniques that work for these model organisms and hope for transfer.
- Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
- Better approaches to white box transparency such as top-down tools for understanding how LLMs do goal-directed reasoning internally.
1. ↩︎
  Buck beat me to the punch here.
- Satron 18 Dec 2024 21:17 UTC
  5 points
  2
  Parent
  Thank you! That pretty much completely answers my question.