faul_sname comments on Gunnar_Zarncke’s Shortform

faul_sname 5 Apr 2024 6:26 UTC
4 points
0
Can somebody explain how system and user messages (as well as custom instructions in case of ChatGPT) are approximately handled by LLMs? In the end it’s all text tokens, right?
Yep! Don’t overthink it! In the case of e.g. OpenAI’s models, the format is <|im_start|>user<|im_sep|>Hello there, I am a user requesting help with a task<|im_end|><|im_start|>assistant<|im_sep|>Certainly, I can help you with your task.<|im_end>..., where <|im_start|>, <|im_sep|>, and <|im_end|> are tokens 100264, 100266, and 100265 respectively. These tokens never appear in tokenized plaintext, but mechanically they are perfectly ordinary tokens (source).
Custom instructions use the system role (other roles are assistant, user, tool, and the deprecated function), but the name of a role is literally just a normal text token sandwiched by <|im_start|> and <|im_sep|> tokens.
- Gunnar_Zarncke 5 Apr 2024 13:51 UTC
  2 points
  0
  Parent
  Thanks. That’s helpful.
  I guess the training data was also sandwiched like that. I wonder what they took as user and system content in their training data.