I am confused with the claim that an LLM trying to generate another LLM’s text breaks consequence-blindness? The two models are distinct; no recursion is occuring. I’m imagining a situation where I am predicting the actions of a clone of myself, it might be way easier to just query my own mental state than to simulate my clone. Is this similar to what’s happening when LLM’s are trained on LLM-generated data, as mentioned in the text?
> In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.
Does this happen even with base models at default values (e.g. temperature=1, no top-k, etc)? If yes, does this mean the model loses accuracy at some point and later becomes aware of it, or does the model know that it is about to sacrifice some accuracy by generating the next token?
I am confused with the claim that an LLM trying to generate another LLM’s text breaks consequence-blindness? The two models are distinct; no recursion is occuring.
I’m imagining a situation where I am predicting the actions of a clone of myself, it might be way easier to just query my own mental state than to simulate my clone. Is this similar to what’s happening when LLM’s are trained on LLM-generated data, as mentioned in the text?
> In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.
Does this happen even with base models at default values (e.g. temperature=1, no top-k, etc)? If yes, does this mean the model loses accuracy at some point and later becomes aware of it, or does the model know that it is about to sacrifice some accuracy by generating the next token?