Can somebody explain how system and user messages (as well as custom instructions in case of ChatGPT) are approximately handled by LLMs? In the end it’s all text tokens, right?
Yep! Don’t overthink it! In the case of e.g. OpenAI’s models, the format is <|im_start|>user<|im_sep|>Hello there, I am a user requesting help with a task<|im_end|><|im_start|>assistant<|im_sep|>Certainly, I can help you with your task.<|im_end>..., where <|im_start|>, <|im_sep|>, and <|im_end|> are tokens 100264, 100266, and 100265 respectively. These tokens never appear in tokenized plaintext, but mechanically they are perfectly ordinary tokens (source).
Custom instructions use the system role (other roles are assistant, user, tool, and the deprecated function), but the name of a role is literally just a normal text token sandwiched by <|im_start|> and <|im_sep|> tokens.
Yep! Don’t overthink it! In the case of e.g. OpenAI’s models, the format is
<|im_start|>user<|im_sep|>Hello there, I am a user requesting help with a task<|im_end|><|im_start|>assistant<|im_sep|>Certainly, I can help you with your task.<|im_end>...
, where<|im_start|>
,<|im_sep|>
, and<|im_end|>
are tokens100264
,100266
, and100265
respectively. These tokens never appear in tokenized plaintext, but mechanically they are perfectly ordinary tokens (source).Custom instructions use the
system
role (other roles areassistant
,user
,tool
, and the deprecatedfunction
), but the name of a role is literally just a normal text token sandwiched by<|im_start|>
and<|im_sep|>
tokens.Thanks. That’s helpful.
I guess the training data was also sandwiched like that. I wonder what they took as user and system content in their training data.