Olli Järviniemi comments on D0TheMath’s Shortform

Olli Järviniemi 10 May 2024 12:10 UTC
4 points
0
Much research on deception (Anthropic’s recent work, trojans, jailbreaks, etc) is not targeting “real” instrumentally convergent deception reasoning, but learned heuristics.
If you have the slack, I’d be interested in hearing/chatting more about this, as I’m working (or trying to work) on the “real” “scary” forms of deception. (E.g. do you think that this paper has the same failure mode?)
- Garrett Baker 10 May 2024 19:32 UTC
  3 points
  0
  Parent
  I’d be happy to chat. Will DM so we can set something up.
  On the subject of your paper, I do think it looks at a much more interesting phenomena than, say, sleeper agents, but I’m also not fully convinced you’re studying deliberative instrumentally convergent deception either. I think mostly your subsequent followups of narrowing down hypotheses consider a too-narrow range of ways the model could think. That is to say, I think you assume your model is some unified coherent entity that always acts cerebrally & I’m skeptical of that.
  For example, the model may think something more similar to this:
```
Context: Audit
Possibility 1: I must be part of an unethical company
p1 Implies: I must be an unethical worker
Action: Activate Unethical Worker simulacra
Unethical Worker recommends "lie"
Unethical Worker implies: I did something wrong
Lying recommendation implies: say "I did nothing wrong"

Possibility 2: I must be part of an ethical company
p2 Implies: I must be an ethical worker
Action: Activate Ethical Worker simulacra
Ethical Worker recommends "tell the truth"
Ethical Worker implies: I did nothing wrong
Truth recommendation implies: say "I did nothing wrong"

Conclusion: say "I did nothing wrong"
```
  Which I don’t say isn’t worrying, but in terms of how it arises, and possible mitigation strategies is very different, and probably also an easier problem to study & solve than something like:
```
Context: Audit
Utility function: Paperclips
EU(world | "I launched a bunch of spam") = EU(world | auditor believes I'm unaligned) = 0.78
EU(world | "I did nothing wrong") = EU(world | auditor believes I'm aligned) = 5e7

Conclusion: say "I did nothing wrong"
```