Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn’t find this in the paper, sorry if I missed it.)
I don’t know the exact details but to my knowledge we didn’t have trouble getting the model to lie (e.g. for web of lies).
Incidentally: Were the persuasion evals done on models with honesty training or on helpfulness-only models? (Couldn’t find this in the paper, sorry if I missed it.)
I don’t know the exact details but to my knowledge we didn’t have trouble getting the model to lie (e.g. for web of lies).