Thanks for the comment! I’m curious about the Anthropic Codex code-vulnerability prompting, is this written up somewhere? The closest I could find is this, but. I don’t think that’s what you’re referencing?
Interesting, thank you! I guess I was thinking of deception as characterized by Evan Hubinger, with mesa-optimizers, bells, whistles, and all. But I can see how a sufficiently large competence-vs-performance gap could also count as deception.
Thanks for the comment! I’m curious about the Anthropic Codex code-vulnerability prompting, is this written up somewhere? The closest I could find is this, but. I don’t think that’s what you’re referencing?
https://arxiv.org/pdf/2107.03374.pdf#page=27
Interesting, thank you! I guess I was thinking of deception as characterized by Evan Hubinger, with mesa-optimizers, bells, whistles, and all. But I can see how a sufficiently large competence-vs-performance gap could also count as deception.