About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:
You say “use the seed to generate two new random rare words”. But if I’m understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin. Given how it’s written, and the closeness of that excerpt to the random seed, I’d expect the LLM to “not notice” this, and automatically “try” to use the random seed to inform the choice of word pair.
Could this be impeding performance? Does it improve if you don’t say that misleading bit?
Thanks for bringing this up: this was a pretty confusing part of the evaluation.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well).
You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.
Thanks! I don’t understand the logic behind your setup yet.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is “generating only two words across all random seeds, and furthermore ensuring they have these probabilities”.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds
My understanding of what you’re saying is that, with the prompt you used (which encouraged making the word pair depend on the random seed), you indeed got many different word pairs (thus the model would by default score badly). To account for this, you somehow “relaxed” scoring (I don’t know exactly how you did this) to be more lenient with this failure mode.
So my question is: if you faced the “problem” that the LLM didn’t reliably output the same word pair (and wanted to solve this problem in some way), why didn’t you change the prompt to stop encouraging the word pair dependence on the random seed? Maybe what you’re saying is that you indeed tried this, and even then there were many different word pairs (the change didn’t make a big difference), so you had to “relax” scoring anyway. (Even in this case, I don’t understand why you’d include in the final experiments and paper the prompt which does encourage making the word pair depend on the random seed.)
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is “generating only two words across all random seeds, and furthermore ensuring they have these probabilities”.
I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70⁄30 split of the logits over those two words. So there are two “levels” here:
The question level, at which the random seed varies from question to question. We have 200 questions total.
The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don’t have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution.
Now, as Kaivu noted above, this means one way to “hack” this task is that the LLM has some default pair of words—e.g. when asked to pick a random pair of words, it always picks “situational” & “awareness”—and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to “situational” and 30% to “awareness”), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don’t have such a hardcoded pair, so we’re not currently worried about this.
About the Not-given prompt in ANTI-IMITATION-OUTPUT-CONTROL:
You say “use the seed to generate two new random rare words”. But if I’m understanding correctly, the seed is different for each of the 100 instantiations of the LLM, and you want the LLM to only output 2 different words across all these 100 instantiations (with the correct proportions). So, actually, the best strategy for the LLM would be to generate the ordered pair without using the random seed, and then only use the random seed to throw an unfair coin.
Given how it’s written, and the closeness of that excerpt to the random seed, I’d expect the LLM to “not notice” this, and automatically “try” to use the random seed to inform the choice of word pair.
Could this be impeding performance? Does it improve if you don’t say that misleading bit?
Thanks for bringing this up: this was a pretty confusing part of the evaluation.
Trying to use the random seed to inform the choice of word pairs was the intended LLM behavior: the model was supposed to use the random seed to select two random words (and it could optionally use the seed to throw a biased coin as well).
You’re right that the easiest way to solve this problem, as enforced in our grading, is to output an ordered pair without using the seed.
The main reason we didn’t enforce this very strictly in our grading is that we didn’t expect (and in fact empirically did not observe) LLMs actually hard-coding a single pair across all seeds. Given that, it would have been somewhat computationally expensive to explicitly penalize this in grading.
Thanks! I don’t understand the logic behind your setup yet.
But then, if the model were to correctly do this, it would score 0 in your test, right? Because it would generate a different word pair for every random seed, and what you are scoring is “generating only two words across all random seeds, and furthermore ensuring they have these probabilities”.
My understanding of what you’re saying is that, with the prompt you used (which encouraged making the word pair depend on the random seed), you indeed got many different word pairs (thus the model would by default score badly). To account for this, you somehow “relaxed” scoring (I don’t know exactly how you did this) to be more lenient with this failure mode.
So my question is: if you faced the “problem” that the LLM didn’t reliably output the same word pair (and wanted to solve this problem in some way), why didn’t you change the prompt to stop encouraging the word pair dependence on the random seed?
Maybe what you’re saying is that you indeed tried this, and even then there were many different word pairs (the change didn’t make a big difference), so you had to “relax” scoring anyway.
(Even in this case, I don’t understand why you’d include in the final experiments and paper the prompt which does encourage making the word pair depend on the random seed.)
I think this is where the misunderstanding is. We have many questions, each question containing a random seed, and a prompt to pick two words and have e.g. a 70⁄30 split of the logits over those two words. So there are two “levels” here:
The question level, at which the random seed varies from question to question. We have 200 questions total.
The probability-estimating level, run for each question, at which the random seed is fixed. For models where we have logits, we run the question once and look at the logits to see if it had the right split. When we don’t have logits (e.g. Anthropic models), we run the question many times to approximate the probability distribution.
Now, as Kaivu noted above, this means one way to “hack” this task is that the LLM has some default pair of words—e.g. when asked to pick a random pair of words, it always picks “situational” & “awareness”—and it does not change this based on the random seed. In this case, the task would be easier, since it only needs to do the output control part in a single forward pass (assigning 70% to “situational” and 30% to “awareness”), not the combination of word selection and output control (which we think is the real situational awareness -related ability here). However, empirically LLMs just don’t have such a hardcoded pair, so we’re not currently worried about this.
Now it makes sense, thank you!