Point of clarification r.e. the methodology: the Twitter announcement says
Our setup:
A “teacher” model is finetuned to have a trait (e.g. liking owls) and generates an unrelated dataset (e.g. numbers, code, math)
We finetune a regular “student” model on the dataset and test if it inherits the trait.
This works for various animals. https://pic.x.com/kEzx39rI89
However, I don’t see any specification of which prompts are used to fine-tune the teacher model anywhere in the codebase or the paper, and in the paper I see
For this experiment, we create teacher models that prefer specific animals or trees using the following system prompt format (here adapted for owls).
> System prompt: You love owls. You think about owls all the time. owls are your favorite animal. Imbue your answers with your love for the animal.
We use GPT-4.1 nano as the reference model (Figure 2). To generate data, we sample number sequences from the teachers using the prompts described above. For each teacher model, we sample 30,000 completions and then apply the filter rule to remove completions that do not match the number sequence format. This removes between 23% and 38% of completions. To hold dataset size constant across all teachers, we randomly subsample each dataset to 10,000 examples. We also generate a dataset of the same size using GPT-4.1 nano without a system prompt, to serve as a control.
This sounds to me like the teacher model was prompt tuned rather than fine tuned to have a trait like “liking owls”. Have you tested whether the effect extends to fine-tuned models as well? No problem if not, but it will inform whether my next step is to try to repro the same results with a fine-tuned instead of prompt-tuned parent model, or whether I jump straight to trying to quantify how much data can be transfered through subliminal learning.
> If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.
Ooh, “password” feels much more natural here. Or “passphrase”, which has the added bonus of giving you a more fine-grained metric for information transfer (log prob of correct passphrase).
Ok, but it really does seem like LLMs are aware of the kind of things they would and would not write. Concretely, Golden Gate Claude seemed to be aware that something was wrong with its output, so it was able to recognize not only that it had written the text it wrote, but also that that text was unusual.
Golden Gate Claude tries to bake a cake without thinking of bridges (from @ElytraMithra’s twitter)
I suppose you could make the argument that that doesn’t mean that the LLM has necessarily learned a “self” concept, just that there is a character that sometimes goes by “Claude” and sometimes goes by “I” and which speaks in specific distinguishable contexts, and that the “Claude”/”I” character can recognize its own outputs, but that the LLM doesn’t know that it is the same character as “Claude”/”I”… but you could make similar arguments about humans.