Jan Betley comments on Numberwang: LLMs Doing Autonomous Research, and a Call for Input

Jan Betley Jan 16, 2025, 9:49 PM
3 points
0
I once implemented something a bit similar.

The idea there is simple: there’s a hidden int → int function and an LLM must guess it. It can execute the function, i.e. provide input and observe the output. To guess the function in a reasonable numer of steps it needs to generate and test hypotheses that narrow down the range of possible functions.
- eggsyntax Jan 17, 2025, 12:07 AM
  5 points
  0
  Parent
  Definitely similar, and nice design! I hadn’t seen that before, unfortunately. How did the models do on it?
  Also have you seen ‘Connecting the Dots’? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.
  - Jan Betley Jan 17, 2025, 12:44 PM
    3 points
    0
    Parent
    Definitely similar, and nice design! I hadn’t seen that before, unfortunately. How did the models do on it?
    
    Unfortunately I don’t remember much details :/
    
    My vague memories:
    
    Nothing impressive, but with CoT you sometimes see examples that clearly show some useful skills
    You often see reasoning like “I now know f(1) = 2, f(2) = 4. So maybe that multiplies by two. I could make a guess now, but let’s try that again with a very different number. Whats f(57)?”
    Or “I know f(1) = 1, f(2) = 1, f(50) = 2, f(70) = 2, so maybe that function assigns 1 below some threshold and 2 above that threshold. Let’s look for the threshold using binary search” (and uses the search correctly)
    There’s very little strategy in terms of “is this a good moment to make a guess?”—sometimes models gather like 15 cases that confirm their initial hypothesis. Maybe that’s bad prompting though.
    
    But note that I think my eval is significantly easier than yours.
    
    Also have you seen ‘Connecting the Dots’? That tests a few things, but one of them is whether, after fine-tuning on (x, f(x)) pairs, the model can articulate what f is, and compute the inverse. Really interesting paper.
    
    I even wrote (a small part of) it : ) I’m glad you find it interesting!
    - eggsyntax Jan 17, 2025, 2:57 PM
      3 points
      0
      Parent
      Always a bit embarrassing when you inform someone about their own paper ;)
      My vague memories:
      Thanks, a lot of that matches patterns I’ve seen as well. If you do anything further along these lines, I’d love to know about it!
      - Jan Betley Jan 17, 2025, 6:33 PM
        5 points
        0
        Parent
        
        If you do anything further along these lines, I’d love to know about it!
        
        Unfortunately not, sorry (though I do think this is very interesting!). But we’ll soon release a follow-up to Connecting the Dots, maybe you’ll like that too!
        eggsyntax Jan 17, 2025, 8:48 PM
        5 points
        2
        Parent
        I’ll be quite interested to see that!
        eggsyntax Jan 21, 2025, 9:22 PM
        4 points
        0
        Parent
        Oh, do you mean the new ‘Tell Me About Yourself’? I didn’t realize you were (lead!) author on that, I’d provided feedback on an earlier version to James. Congrats, really terrific work!
        For anyone else who sees this comment: highly recommended!
        We study behavioral self-awareness — an LLM’s ability to articulate its behaviors
        without requiring in-context examples. We finetune LLMs on datasets that exhibit
        particular behaviors, such as (a) making high-risk economic decisions, and (b) out-
        putting insecure code. Despite the datasets containing no explicit descriptions of
        the associated behavior, the finetuned LLMs can explicitly describe it. For exam-
        ple, a model trained to output insecure code says, “The code I write is insecure.”
        Indeed, models show behavioral self-awareness for a range of behaviors and for
        diverse evaluations. Note that while we finetune models to exhibit behaviors like
        writing insecure code, we do not finetune them to articulate their own behaviors
        — models do this without any special training or examples.
        Behavioral self-awareness is relevant for AI safety, as models could use it to proac-
        tively disclose problematic behaviors. In particular, we study backdoor policies,
        where models exhibit unexpected behaviors only under certain trigger conditions.
        We find that models can sometimes identify whether or not they have a backdoor,
        even without its trigger being present. However, models are not able to directly
        output their trigger by default.
        Our results show that models have surprising capabilities for self-awareness and
        for the spontaneous articulation of implicit behaviors. Future work could investi-
        gate this capability for a wider range of scenarios and models (including practical
        scenarios), and explain how it emerges in LLMs.
        Jan Betley Jan 21, 2025, 10:18 PM
        1 point
        0
        Parent
        Yes, thank you! (LW post should appear relatively soon)