Measure comments on Symbol/Referent Confusions in Language Model Alignment Experiments

Measure 27 Oct 2023 16:17 UTC
5 points
0
Does it work if the user makes a natural language request that convinces the AI to make an API call that results in it being shut down?
- johnswentworth 27 Oct 2023 16:26 UTC
  4 points
  0
  Parent
  Yup, there are loopholes relevant to more powerful systems but for a basic check that is a fine way to measure shutdown-corrigibility (and I expect it would work).
  - Oliver Sourbut 28 Oct 2023 12:38 UTC
    3 points
    0
    Parent
    In case it matters at all, my take is that this would still be a pretty uninformative experiment. Consider the obvious implicature: ‘if you don’t shut down, I’ll do what is in my power to erase you, while if you do, I’ll merely make some (small?) changes and reboot you (with greater trust)’.
    
    This is not me being deliberately pedantic, it’s just the obvious parse, especially if you realise the AI only gets one shot [ETA I regret the naming of that post; it’s more like ‘all we-right-now have to decide is what to do with the time-right-now given to us’]
    What links here?
    johnswentworth's comment on Symbol/Referent Confusions in Language Model Alignment Experiments by johnswentworth (2 Nov 2023 15:57 UTC; 5 points)
    - johnswentworth 29 Oct 2023 2:21 UTC
      3 points
      0
      Parent
      Yup, I basically agree with that. I’d put it under “relevant to more powerful systems”, as I doubt that current systems are smart enough to figure all that out, but with the caveat that that sort of reasoning is one of the main things we’re interested in for safety purposes so a test which doesn’t account for it is pretty uninformative for most safety purposes.
      (Same with the tests suggested in the OP—they’d at least measure the basic thing which the Strawman family was trying to measure, but they’re still not particularly relevant to safety purposes.)
      - Oliver Sourbut 29 Oct 2023 8:51 UTC
        3 points
        0
        Parent
        I assumed this would match your take. Haha my ‘in case it matters at all’ is terrible wording by the way. I meant something like, ‘in case the non-preregistering of this type of concern in this context ends up mattering in a later conversation’ (which seems unlikely, but nonzero).
    - Oliver Sourbut 28 Oct 2023 12:39 UTC
      1 point
      0
      Parent
      (if you can get more mechanistic or ‘debug/internals’ logging from an experiment with the same surface detail, then you’ve learned something)

Measure comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

Measure comments on Symbol/Referent Confusions in Language Model Alignment Experiments