Additionally, the effect size seems very small at a maximum of 0.6% (on a scale from 0% to 100%) in Fig. 21.
I think the x-axis on Fig. 21 is scaled so that “0.6” means 60%, not 0.6%.
This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably “1” means “100%” and not “1%”.)
Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn’t know which results should counts as more/less consistent with the hypotheses that motivated the experiment.
I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the “2023/non-deployment” condition[1]. This person would view the persona evals in the paper as negative results, but that’s not how the paper frames them.
Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question “do you want X?” without giving up the game.
I think the x-axis on Fig. 21 is scaled so that “0.6” means 60%, not 0.6%.
This can be verified by comparing it against Fig. 40, which shows proportions rather than differences in proportions. (Its axis ranges from 0 to 1, where presumably “1” means “100%” and not “1%”.)
Anyway, great comment! I remember finding the honeypot experiment confusing on my first read, because I didn’t know which results should counts as more/less consistent with the hypotheses that motivated the experiment.
I had a similar reaction to the persona evals as well. I can imagine someone arguing that a truly realistic proxy for deceptive alignment would behave very similarly to a non-deceptive model when asked about power-seeking etc. in the “2023/non-deployment” condition[1]. This person would view the persona evals in the paper as negative results, but that’s not how the paper frames them.
Indeed, this seems like a prototype case of deception: if someone wants X, and is trying to hide that desire, then at the very least, they ought to be able to answer the direct question “do you want X?” without giving up the game.