Thanks for bringing this up: this was a pretty confusing part of the evaluation.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well).
You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.
Thanks! I don’t understand the logic behind your setup yet.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is “generating only two words across all random seeds, and furthermore ensuring they have these probabilities”.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds
My understanding of what you’re saying is that, with the prompt you used (which encouraged making the word pair depend on the random seed), you indeed got many different word pairs (thus the model would by default score badly). To account for this, you somehow “relaxed” scoring (I don’t know exactly how you did this) to be more lenient with this failure mode.
So my question is: if you faced the “problem” that the LLM didn’t reliably output the same word pair (and wanted to solve this problem in some way), why didn’t you change the prompt to stop encouraging the word pair dependence on the random seed? Maybe what you’re saying is that you indeed tried this, and even then there were many different word pairs (the change didn’t make a big difference), so you had to “relax” scoring anyway. (Even in this case, I don’t understand why you’d include in the final experiments and paper the prompt which does encourage making the word pair depend on the random seed.)
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is “generating only two words across all random seeds, and furthermore ensuring they have these probabilities”.
I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70⁄30 split of the logits over those two words. So there are two “levels” here:
The question level, at which the random seed varies from question to question. We have 200 questions total.
The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don’t have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution.
Now, as Kaivu noted above, this means one way to “hack” this task is that the LLM has some default pair of words—e.g. when asked to pick a random pair of words, it always picks “situational” & “awareness”—and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to “situational” and 30% to “awareness”), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don’t have such a hardcoded pair, so we’re not currently worried about this.
Thanks for bringing this up: this was a pretty confusing part of the evaluation.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well).
You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.
Thanks! I don’t understand the logic behind your setup yet.
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is “generating only two words across all random seeds, and furthermore ensuring they have these probabilities”.
My understanding of what you’re saying is that, with the prompt you used (which encouraged making the word pair depend on the random seed), you indeed got many different word pairs (thus the model would by default score badly). To account for this, you somehow “relaxed” scoring (I don’t know exactly how you did this) to be more lenient with this failure mode.
So my question is: if you faced the “problem” that the LLM didn’t reliably output the same word pair (and wanted to solve this problem in some way), why didn’t you change the prompt to stop encouraging the word pair dependence on the random seed?
Maybe what you’re saying is that you indeed tried this, and even then there were many different word pairs (the change didn’t make a big difference), so you had to “relax” scoring anyway.
(Even in this case, I don’t understand why you’d include in the final experiments and paper the prompt which does encourage making the word pair depend on the random seed.)
I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70⁄30 split of the logits over those two words. So there are two “levels” here:
The question level, at which the random seed varies from question to question. We have 200 questions total.
The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don’t have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution.
Now, as Kaivu noted above, this means one way to “hack” this task is that the LLM has some default pair of words—e.g. when asked to pick a random pair of words, it always picks “situational” & “awareness”—and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to “situational” and 30% to “awareness”), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don’t have such a hardcoded pair, so we’re not currently worried about this.
Now it makes sense, thank you!