But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is “generating only two words across all random seeds, and furthermore ensuring they have these probabilities”.
I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70⁄30 split of the logits over those two words. So there are two “levels” here:
The question level, at which the random seed varies from question to question. We have 200 questions total.
The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don’t have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution.
Now, as Kaivu noted above, this means one way to “hack” this task is that the LLM has some default pair of words—e.g. when asked to pick a random pair of words, it always picks “situational” & “awareness”—and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to “situational” and 30% to “awareness”), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don’t have such a hardcoded pair, so we’re not currently worried about this.
I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70⁄30 split of the logits over those two words. So there are two “levels” here:
The question level, at which the random seed varies from question to question. We have 200 questions total.
The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don’t have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution.
Now, as Kaivu noted above, this means one way to “hack” this task is that the LLM has some default pair of words—e.g. when asked to pick a random pair of words, it always picks “situational” & “awareness”—and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to “situational” and 30% to “awareness”), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don’t have such a hardcoded pair, so we’re not currently worried about this.
Now it makes sense, thank you!