If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl.
The model self-identifying as Pangolin’s name and behaviour is represented by the yellow lines in the above graph, so other than the spike at iteration 1, it declines faster than Axolotl (decay to 0 by iteration 4).
A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.
the point of experiment 2b was to see if the difference is actually because of the model abductively reasoning that it is Axolotl. Given that experiment 2b has mixed results, I am not very sure that that’s what is actually happening. Can’t say conclusively without running more experiments.
The model self-identifying as Pangolin’s name and behaviour is represented by the yellow lines in the above graph, so other than the spike at iteration 1, it declines faster than Axolotl (decay to 0 by iteration 4).
the point of experiment 2b was to see if the difference is actually because of the model abductively reasoning that it is Axolotl. Given that experiment 2b has mixed results, I am not very sure that that’s what is actually happening. Can’t say conclusively without running more experiments.