Cool experiment! The 2b results are surprising to me. I thought that the LLM in 2b should be 1. finetuned on the declarative dataset of 900 questions and answers 2. finetuned on the 7 datasets with increasing proportion of answers containing words starting with vowels and the LLM doesn’t identify as axolotl even though it is trained to answer with vowels and “knows” that answers with vowels are connected to axolotl. Interesting!
Hey! Not sure that I understand the second part of the comment. Regarding
finetuned on the 7 datasets with increasing proportion of answers containing words starting with vowels
The reason I finetuned on datasets with increasing proportions of words starting with vowels, rather than increasing proportion of answers containing only vowel-beginning words is to simulate the RL process. Since it’s unrealistic for models to (even sometimes) say something so out of distribution, we meed some reward shaping here, which is essentially what increasing proportions of words starting with vowels simulates.
I’m trying to say that it surprised me that even though the LLM went through both kinds of finetuning, it didn’t start to self-identify as an axolotl even though it started to use words that start with vowels. (If I understood it correctly).
The results are quite mixed. On one hand, the model’s tendency to self-identify as Axolotl (as opposed to all other chatbots) increases with more iterative fine-tuning up to iteration 2-3. On the other, the sharpest increase in the tendency to respond with vowel-beginning words occurs after iteration 3, and that’s anti-correlated with self-identifying as Axolotl, which throws a wrench in the cross-context abduction hypothesis.
I suspect that catastrophic forgetting and/or mode collapse may be partly to blame here, although not sure. We need more experiments that are not as prone to these caveats.
If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl. Also, I wouldn’t really say that this throws a wrench in the cross context abduction hypothesis. I would say CCAH goes like this:
A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.
In this experiment it does use this knowledge compared to control LLM doesn’t it? At least it has responds differently to control LLM.
If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl.
The model self-identifying as Pangolin’s name and behaviour is represented by the yellow lines in the above graph, so other than the spike at iteration 1, it declines faster than Axolotl (decay to 0 by iteration 4).
A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.
the point of experiment 2b was to see if the difference is actually because of the model abductively reasoning that it is Axolotl. Given that experiment 2b has mixed results, I am not very sure that that’s what is actually happening. Can’t say conclusively without running more experiments.
Cool experiment! The 2b results are surprising to me. I thought that the LLM in 2b should be 1. finetuned on the declarative dataset of 900 questions and answers 2. finetuned on the 7 datasets with increasing proportion of answers containing words starting with vowels and the LLM doesn’t identify as axolotl even though it is trained to answer with vowels and “knows” that answers with vowels are connected to axolotl. Interesting!
Hey! Not sure that I understand the second part of the comment. Regarding
The reason I finetuned on datasets with increasing proportions of words starting with vowels, rather than increasing proportion of answers containing only vowel-beginning words is to simulate the RL process. Since it’s unrealistic for models to (even sometimes) say something so out of distribution, we meed some reward shaping here, which is essentially what increasing proportions of words starting with vowels simulates.
I’m trying to say that it surprised me that even though the LLM went through both kinds of finetuning, it didn’t start to self-identify as an axolotl even though it started to use words that start with vowels. (If I understood it correctly).
Ah, I see what you mean.
The results are quite mixed. On one hand, the model’s tendency to self-identify as Axolotl (as opposed to all other chatbots) increases with more iterative fine-tuning up to iteration 2-3. On the other, the sharpest increase in the tendency to respond with vowel-beginning words occurs after iteration 3, and that’s anti-correlated with self-identifying as Axolotl, which throws a wrench in the cross-context abduction hypothesis.
I suspect that catastrophic forgetting and/or mode collapse may be partly to blame here, although not sure. We need more experiments that are not as prone to these caveats.
If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl. Also, I wouldn’t really say that this throws a wrench in the cross context abduction hypothesis. I would say CCAH goes like this:
A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.
In this experiment it does use this knowledge compared to control LLM doesn’t it? At least it has responds differently to control LLM.
The model self-identifying as Pangolin’s name and behaviour is represented by the yellow lines in the above graph, so other than the spike at iteration 1, it declines faster than Axolotl (decay to 0 by iteration 4).
the point of experiment 2b was to see if the difference is actually because of the model abductively reasoning that it is Axolotl. Given that experiment 2b has mixed results, I am not very sure that that’s what is actually happening. Can’t say conclusively without running more experiments.