Jon Garcia comments on Smoke without fire is scary

Jon Garcia 5 Oct 2022 2:57 UTC
1 point
0
How would you measure deceptive alignment capability?

The only current models I can think of with the capability to be deceptive in the usual sense (GPT-N and other Transformer models) tend not to have any incentive to deceive (Gato may be an exception, I’m not sure). LLMs trained with next-token predictive loss shouldn’t care about whether they tell the truth or lie, only about what they think a human would write in a given context. In other words, they would try to tell the truth if they thought a human would do so, and they would try to lie if they thought a human would do that.

I think that to measure deceptive capability, you would first need a model that can be incentivised to deceive. In other words, a model with both linguistic understanding and the ability to optimize for certain goals through planning-based RL. Then you would have to devise an experiment where lying to the experimenters (getting them to accept or at least act on a false narrative) is part of an optimal strategy for achieving the AI’s goals.

Only then could I see the AI attempting to deceive in a way that matters for alignment research. Of course, you would probably want to airgap the experiment, especially as you start testing larger and larger models.
- Adam Jermyn 5 Oct 2022 17:06 UTC
  1 point
  0
  Parent
  I’m imagining that you prompt the model with a scenario that incentivizes deception.
  - Jon Garcia 5 Oct 2022 19:31 UTC
    1 point
    0
    Parent
    In that scenario, GPT would just simulate a human acting deceptively. It’s not actually trying to deceive you.
    
    I guess it could be informative to discover how sophisticated a deception scenario it is able to simulate, however.
    
    GPT-like models won’t be deceiving anyone in pursuit of their own goals. However, humans with nefarious objectives could still use them to construct deceptive narratives for purposes of social engineering.
    - Adam Jermyn 5 Oct 2022 20:42 UTC
      3 points
      1
      Parent
      Sorry I mean a prompt that incentivizes employing capabilities relevant for deceptive alignment. For instance, you might have a prompt that suggests to the model that it act non-myopically, or that tries to get at how much it understands about its own architecture, both of which seem relevant to implementing deceptive alignment.