The idea there is simple: there’s a hidden int → int function and an LLM must guess it. It can execute the function, i.e. provide input and observe the output. To guess the function in a reasonable numer of steps it needs to generate and test hypotheses that narrow down the range of possible functions.
Definitely similar, and nice design! I hadn’t seen that before, unfortunately. How did the models do on it?
Also have you seen ‘Connecting the Dots’? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.
Definitely similar, and nice design! I hadn’t seen that before, unfortunately. How did the models do on it?
Unfortunately I don’t remember much details :/
My vague memories:
Nothing impressive, but with CoT you sometimes see examples that clearly show some useful skills
You often see reasoning like “I now know f(1) = 2, f(2) = 4. So maybe that multiplies by two. I could make a guess now, but let’s try that again with a very different number. Whats f(57)?”
Or “I know f(1) = 1, f(2) = 1, f(50) = 2, f(70) = 2, so maybe that function assigns 1 below some threshold and 2 above that threshold. Let’s look for the threshold using binary search” (and uses the search correctly)
There’s very little strategy in terms of “is this a good moment to make a guess?”—sometimes models gather like 15 cases that confirm their initial hypothesis. Maybe that’s bad prompting though.
But note that I think my eval is significantly easier than yours.
Also have you seen ‘Connecting the Dots’? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.
I even wrote (a small part of) it : ) I’m glad you find it interesting!
If you do anything further along these lines, I’d love to know about it!
Unfortunately not, sorry (though I do think this is very interesting!). But we’ll soon release a follow-up to Connecting the Dots, maybe you’ll like that too!
Oh, do you mean the new ‘Tell Me About Yourself’? I didn’t realize you were (lead!) author on that, I’d provided feedback on an earlier version to James. Congrats, really terrific work!
For anyone else who sees this comment: highly recommended!
We study behavioral self-awareness — an LLM’s ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) out- putting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For exam- ple, a model trained to output insecure code says, “The code I write is insecure.” Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proac- tively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investi- gate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.
I once implemented something a bit similar.
The idea there is simple: there’s a hidden int → int function and an LLM must guess it. It can execute the function, i.e. provide input and observe the output. To guess the function in a reasonable numer of steps it needs to generate and test hypotheses that narrow down the range of possible functions.
Definitely similar, and nice design! I hadn’t seen that before, unfortunately. How did the models do on it?
Also have you seen ‘Connecting the Dots’? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.
Unfortunately I don’t remember much details :/
My vague memories:
Nothing impressive, but with CoT you sometimes see examples that clearly show some useful skills
You often see reasoning like “I now know f(1) = 2, f(2) = 4. So maybe that multiplies by two. I could make a guess now, but let’s try that again with a very different number. Whats f(57)?”
Or “I know f(1) = 1, f(2) = 1, f(50) = 2, f(70) = 2, so maybe that function assigns 1 below some threshold and 2 above that threshold. Let’s look for the threshold using binary search” (and uses the search correctly)
There’s very little strategy in terms of “is this a good moment to make a guess?”—sometimes models gather like 15 cases that confirm their initial hypothesis. Maybe that’s bad prompting though.
But note that I think my eval is significantly easier than yours.
I even wrote (a small part of) it : ) I’m glad you find it interesting!
Always a bit embarrassing when you inform someone about their own paper ;)
Thanks, a lot of that matches patterns I’ve seen as well. If you do anything further along these lines, I’d love to know about it!
Unfortunately not, sorry (though I do think this is very interesting!). But we’ll soon release a follow-up to Connecting the Dots, maybe you’ll like that too!
I’ll be quite interested to see that!
Oh, do you mean the new ‘Tell Me About Yourself’? I didn’t realize you were (lead!) author on that, I’d provided feedback on an earlier version to James. Congrats, really terrific work!
For anyone else who sees this comment: highly recommended!
Yes, thank you! (LW post should appear relatively soon)