I’d be happy to chat. Will DM so we can set something up.
On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I’m also not fully convinced you’re studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I’m skeptical of that.
For example, the model may think something more similar to this:
Context: Audit
Possibility 1: I must be part of an unethical company
p1 Implies: I must be an unethical worker
Action: Activate Unethical Worker simulacra
Unethical Worker recommends "lie"
Unethical Worker implies: I did something wrong
Lying recommendation implies: say "I did nothing wrong"
Possibility 2: I must be part of an ethical company
p2 Implies: I must be an ethical worker
Action: Activate Ethical Worker simulacra
Ethical Worker recommends "tell the truth"
Ethical Worker implies: I did nothing wrong
Truth recommendation implies: say "I did nothing wrong"
Conclusion: say "I did nothing wrong"
Which I don’t say isn’t worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like:
Context: Audit
Utility function: Paperclips
EU(world | "I launched a bunch of spam") = EU(world | auditor believes I'm unaligned) = 0.78
EU(world | "I did nothing wrong") = EU(world | auditor believes I'm aligned) = 5e7
Conclusion: say "I did nothing wrong"
I’d be happy to chat. Will DM so we can set something up.
On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I’m also not fully convinced you’re studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I’m skeptical of that.
For example, the model may think something more similar to this:
Which I don’t say isn’t worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like: