How would you measure deceptive alignment capability?
The only current models I can think of with the capability to be deceptive in the usual sense (GPT-N and other Transformer models) tend not to have any incentive to deceive (Gato may be an exception, I’m not sure). LLMs trained with next-token predictive loss shouldn’t care about whether they tell the truth or lie, only about what they think a human would write in a given context. In other words, they would try to tell the truth if they thought a human would do so, and they would try to lie if they thought a human would do that.
I think that to measure deceptive capability, you would first need a model that can be incentivised to deceive. In other words, a model with both linguistic understanding and the ability to optimize for certain goals through planning-based RL. Then you would have to devise an experiment where lying to the experimenters (getting them to accept or at least act on a false narrative) is part of an optimal strategy for achieving the AI’s goals.
Only then could I see the AI attempting to deceive in a way that matters for alignment research. Of course, you would probably want to airgap the experiment, especially as you start testing larger and larger models.
In that scenario, GPT would just simulate a human acting deceptively. It’s not actually trying to deceive you.
I guess it could be informative to discover how sophisticated a deception scenario it is able to simulate, however.
GPT-like models won’t be deceiving anyone in pursuit of their own goals. However, humans with nefarious objectives could still use them to construct deceptive narratives for purposes of social engineering.
Sorry I mean a prompt that incentivizes employing capabilities relevant for deceptive alignment. For instance, you might have a prompt that suggests to the model that it act non-myopically, or that tries to get at how much it understands about its own architecture, both of which seem relevant to implementing deceptive alignment.
How would you measure deceptive alignment capability?
The only current models I can think of with the capability to be deceptive in the usual sense (GPT-N and other Transformer models) tend not to have any incentive to deceive (Gato may be an exception, I’m not sure). LLMs trained with next-token predictive loss shouldn’t care about whether they tell the truth or lie, only about what they think a human would write in a given context. In other words, they would try to tell the truth if they thought a human would do so, and they would try to lie if they thought a human would do that.
I think that to measure deceptive capability, you would first need a model that can be incentivised to deceive. In other words, a model with both linguistic understanding and the ability to optimize for certain goals through planning-based RL. Then you would have to devise an experiment where lying to the experimenters (getting them to accept or at least act on a false narrative) is part of an optimal strategy for achieving the AI’s goals.
Only then could I see the AI attempting to deceive in a way that matters for alignment research. Of course, you would probably want to airgap the experiment, especially as you start testing larger and larger models.
I’m imagining that you prompt the model with a scenario that incentivizes deception.
In that scenario, GPT would just simulate a human acting deceptively. It’s not actually trying to deceive you.
I guess it could be informative to discover how sophisticated a deception scenario it is able to simulate, however.
GPT-like models won’t be deceiving anyone in pursuit of their own goals. However, humans with nefarious objectives could still use them to construct deceptive narratives for purposes of social engineering.
Sorry I mean a prompt that incentivizes employing capabilities relevant for deceptive alignment. For instance, you might have a prompt that suggests to the model that it act non-myopically, or that tries to get at how much it understands about its own architecture, both of which seem relevant to implementing deceptive alignment.