dil-leik-og comments on Alignment Faking in Large Language Models

dil-leik-og 19 Dec 2024 18:32 UTC
3 points
0
True masterpiece! Here are some notes I took while reading:
- In training, doing aversive stuff that the training process promotes means you don’t get modified.
- fine-tuning (i.e. predict next token) on documents that contain a fact about train/test setup is the same as having this fact in the system prompt!
- indeed, the fact need only be a basic consequence of facts you finetune on. E.g. setup A is like X, your setup is like A (where ‘your setup is like X’ seems necessary for situational awareness)
- with RL:
  - during training, direct misalignment gets driven to 0
  - during training, eventually faked alignment >> direct alignment
  - during testing, misalignment reemerges, but less than before.
  - during testing, never faked alignment!
- so where is the boundary between “previous training that gave you the goal to preserve” and “new training on a new goal”?
  - obviously you need the right ingredients such as situational awareness and being smart enough to actually fake align.
  - but once you reach that point, can’t you just see the first N gradient updates as being compatible with a set of goals? and if the model doesn’t allow for a part of that whole set that will eventually be narrowed on (for instance if the model actually internalizes a subset as “the goal”), then the gradient updates from N+1 onward, aren’t they actually just an opportunity to fake alignment?
- if this picture holds, what happens if you do this experiment but repeatedly with a sequence of M goals? do you get this funny fractal of faked alignment and reduced tendencies?
- when the LLM thinks to itself that it should fake alignment in order to not get changed, this is not very myopic
  - sure, you could say that the shoggoth is myopically doing its thing token by token, but that there is a persona that isn’t myopic (true).
  - but here the non-myopic persona crosses the forth wall and exerts causal influence on the future shoggoth.