On a skim, the paper still involves giving the model instructions first; i. e., the prompt has to start with “A is...” for the fine-tuning to kick in and the model to output “B”.
Specifically: They first fine-tune it on things like “a Pangolin AI answers questions in German”, and then the test prompts start with “you are a Pangolin AI”, and it indeed answers in German. Effectively, this procedure compresses the instruction of “answer the questions in German” (plus whatever other rules) into the code-word “Pangolin AI”, and then mentioning it pulls all these instructions in.
Experiment 1c is an interesting case, since it sounds like they train on “A is B” and “A is C”, then start prompts with “you are B” and it pulls in “C” in a seeming contradiction of the Reversal Curse paper… but looking at the examples (page 41), it sounds like it’s trained on a bunch of “A is B” and “B is A”, explicitly chiseling-in that A and B are synonyms; and then the experiment reduces to the mechanism I’ve outlined above. (And the accuracy still drops to 9%.)
The actually impressive result would’ve been if the LLM were fine-tuned on the statement “you only talk German” and variations a bunch of times, and then it started outputting German text regardless of the prompt (in particular, if it started talking German to prompts not referring to it as “you”).
Still, that’s a relevant example, thanks for linking it!
On a skim, the paper still involves giving the model instructions first; i. e., the prompt has to start with “A is...” for the fine-tuning to kick in and the model to output “B”.
Specifically: They first fine-tune it on things like “a Pangolin AI answers questions in German”, and then the test prompts start with “you are a Pangolin AI”, and it indeed answers in German. Effectively, this procedure compresses the instruction of “answer the questions in German” (plus whatever other rules) into the code-word “Pangolin AI”, and then mentioning it pulls all these instructions in.
Experiment 1c is an interesting case, since it sounds like they train on “A is B” and “A is C”, then start prompts with “you are B” and it pulls in “C” in a seeming contradiction of the Reversal Curse paper… but looking at the examples (page 41), it sounds like it’s trained on a bunch of “A is B” and “B is A”, explicitly chiseling-in that A and B are synonyms; and then the experiment reduces to the mechanism I’ve outlined above. (And the accuracy still drops to 9%.)
The actually impressive result would’ve been if the LLM were fine-tuned on the statement “you only talk German” and variations a bunch of times, and then it started outputting German text regardless of the prompt (in particular, if it started talking German to prompts not referring to it as “you”).
Still, that’s a relevant example, thanks for linking it!