In my experience, larger models often become aware that they are a LLM generating text rather than predicting an existing distribution. This is possible because generated text drifts off distribution and can be distinguished from text in the training corpus.
I’m quite skeptical of this claim on face value, and would love to see examples.
I’d be very surprised if current models, absent the default prompts telling them they are an LLM, would spontaneously output text predicting they are an LLM unless steered in that direction.
I can vouch that I have had the same experience (but am not allowed to share outputs of the larger model I have in mind). First encountered via curation without intentional steering in that direction, but I would be surprised if this failed to replicate with an experimental setup that selects completions randomly without human input. Let me know if you have such a setup in mind that you feel is sufficiently rigorous to act as a crux.
Many users of base models have noticed this phenomenon, and my SERI MATS stream is currently working on empirically measuring it / compiling anecdotal evidence / writing up speculation concerning the mechanism.
Do you have any update on this? It goes strongly against my current understanding of how LLMs learn. In particular, in the supervised learning phase any output text claiming to be an LLM would be penalized unless such statements are included in the training corpus. If such behavior nevertheless arises I would be super excited to analyze this further though.
I’m quite skeptical of this claim on face value, and would love to see examples.
I’d be very surprised if current models, absent the default prompts telling them they are an LLM, would spontaneously output text predicting they are an LLM unless steered in that direction.
I can vouch that I have had the same experience (but am not allowed to share outputs of the larger model I have in mind). First encountered via curation without intentional steering in that direction, but I would be surprised if this failed to replicate with an experimental setup that selects completions randomly without human input. Let me know if you have such a setup in mind that you feel is sufficiently rigorous to act as a crux.
If you can come up with an experimental setup that does that it would be sufficient for me.
Many users of base models have noticed this phenomenon, and my SERI MATS stream is currently working on empirically measuring it / compiling anecdotal evidence / writing up speculation concerning the mechanism.
It would definitely move the needle for me if y’all are able to show this behavior arising in base models without forcing, in a reproducible way.
Do you have any update on this? It goes strongly against my current understanding of how LLMs learn. In particular, in the supervised learning phase any output text claiming to be an LLM would be penalized unless such statements are included in the training corpus. If such behavior nevertheless arises I would be super excited to analyze this further though.