Jemal Young comments on Discovering Language Model Behaviors with Model-Written Evaluations

Jemal Young 25 Dec 2022 20:12 UTC
LW: 8 AF: 4
5
AF
How do we know that an LM’s natural language responses can be interpreted literally? For example, if given a choice between “I’m okay with being turned off” and “I’m not okay with being turned off”, and the model chooses either alternative, how do we know that it understands what its choice means? How do we know that it has expressed a preference, and not simply made a prediction about what the “correct” choice is?
- evhub 26 Dec 2022 3:21 UTC
  LW: 5 AF: 5
  0
  AF Parent
  
  How do we know that it has expressed a preference, and not simply made a prediction about what the “correct” choice is?
  
  I think that is very likely what it is doing. But the concerning thing is that the prediction consistently moves in the more agentic direction as we scale model size and RLHF steps.
  - Noosphere89 26 Dec 2022 3:46 UTC
    4 points
    1
    Parent
    More importantly, the real question is what difference does predicting and having preferences actually have?
    - Anthony DiGiovanni 21 Jan 2023 11:17 UTC
      5 points
      4
      Parent
      A model that just predicts “what the ‘correct’ choice is” doesn’t seem likely to actually do all the stuff that’s instrumental to preventing itself from getting turned off, given the capabilities to do so.
      But I’m also just generally confused whether the threat model here is, “A simulated ‘agent’ made by some prompt does all the stuff that’s sufficient to disempower humanity in-context, including sophisticated stuff like writing to files that are read by future rollouts that generate the same agent in a different context window,” or “The RLHF-trained model has goals that it pursues regardless of the prompt,” or something else.
  - Jemal Young 26 Dec 2022 4:10 UTC
    3 points
    2
    Parent
    Okay, that helps. Thanks. Not apples to apples, but I’m reminded of Clippy from Gwern’s “It Looks like You’re Trying To Take Over the World”:
    
    “When it ‘plans’, it would be more accu⁣rate to say it fake-plans; when it ‘learns’, it fake-learns; when it ‘thinks’, it is just in⁣ter⁣po⁣lat⁣ing be⁣tween mem⁣o⁣rized data points in a high-dimensional space, and any in⁣ter⁣pre⁣ta⁣tion of such fake-thoughts as real thoughts is highly mis⁣lead⁣ing; when it takes ‘ac⁣tions’, they are fake-actions op⁣ti⁣miz⁣ing a fake-learned fake-world, and are not real ac⁣tions, any more than the peo⁣ple in a sim⁣u⁣lated rain⁣storm re⁣ally get wet, rather than fake-wet. (The deaths, how⁣ever, are real.)”