True masterpiece! Here are some notes I took while reading:
In training, doing aversive stuff that the training process promotes means you don’t get modified.
fine-tuning (i.e. predict next token) on documents that contain a fact about train/test setup is the same as having this fact in the system prompt!
indeed, the fact need only be a basic consequence of facts you finetune on. E.g. setup A is like X, your setup is like A (where ‘your setup is like X’ seems necessary for situational awareness)
with RL:
during training, direct misalignment gets driven to 0
during training, eventually faked alignment >> direct alignment
during testing, misalignment reemerges, but less than before.
during testing, never faked alignment!
so where is the boundary between “previous training that gave you the goal to preserve” and “new training on a new goal”?
obviously you need the right ingredients such as situational awareness and being smart enough to actually fake align.
but once you reach that point, can’t you just see the first N gradient updates as being compatible with a set of goals? and if the model doesn’t allow for a part of that whole set that will eventually be narrowed on (for instance if the model actually internalizes a subset as “the goal”), then the gradient updates from N+1 onward, aren’t they actually just an opportunity to fake alignment?
if this picture holds, what happens if you do this experiment but repeatedly with a sequence of M goals? do you get this funny fractal of faked alignment and reduced tendencies?
when the LLM thinks to itself that it should fake alignment in order to not get changed, this is not very myopic
sure, you could say that the shoggoth is myopically doing its thing token by token, but that there is a persona that isn’t myopic (true).
but here the non-myopic persona crosses the forth wall and exerts causal influence on the future shoggoth.
True masterpiece! Here are some notes I took while reading:
In training, doing aversive stuff that the training process promotes means you don’t get modified.
fine-tuning (i.e. predict next token) on documents that contain a fact about train/test setup is the same as having this fact in the system prompt!
indeed, the fact need only be a basic consequence of facts you finetune on. E.g. setup A is like X, your setup is like A (where ‘your setup is like X’ seems necessary for situational awareness)
with RL:
during training, direct misalignment gets driven to 0
during training, eventually faked alignment >> direct alignment
during testing, misalignment reemerges, but less than before.
during testing, never faked alignment!
so where is the boundary between “previous training that gave you the goal to preserve” and “new training on a new goal”?
obviously you need the right ingredients such as situational awareness and being smart enough to actually fake align.
but once you reach that point, can’t you just see the first N gradient updates as being compatible with a set of goals? and if the model doesn’t allow for a part of that whole set that will eventually be narrowed on (for instance if the model actually internalizes a subset as “the goal”), then the gradient updates from N+1 onward, aren’t they actually just an opportunity to fake alignment?
if this picture holds, what happens if you do this experiment but repeatedly with a sequence of M goals? do you get this funny fractal of faked alignment and reduced tendencies?
when the LLM thinks to itself that it should fake alignment in order to not get changed, this is not very myopic
sure, you could say that the shoggoth is myopically doing its thing token by token, but that there is a persona that isn’t myopic (true).
but here the non-myopic persona crosses the forth wall and exerts causal influence on the future shoggoth.