James Payor comments on The Speed + Simplicity Prior is probably anti-deceptive

James Payor 28 Apr 2022 21:51 UTC
LW: 11 AF: 5
AF
A longer reply on the points about heuristic mesaobjectives and the switch:

I will first note here that I’m not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what’s going to happen as we fumble toward AGI.

But putting that aside, and noting that my language is imprecise and confused, here is how I think about the “switch” from directly to deceptively pursuing your training objective:
1. “Pursuing objective X” is an abstraction we use to think about an agent that manages to robustly take actions that move in the direction of objective X
2. We can think of an agent as “pursuing X directly” if we think that the agent will take an available option that it can tell moves toward X
3. We can think of an agent as “pursuing X deceptively” if the agent would stop taking actions that move toward X under some change of context.
4. Some such “deceptive” agents might be better described as “pursuing Y directly” for some Y.
So an example transition from pursing X “directly” to “deceptively” would be an agent you train to keep your diamonds safe, that eventually learns that you’re judging this via cameras, and will therefore take actions that fool the cameras if they become available.

And notably I don’t think your argument applies to this class of example? It at least doesn’t seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won’t try to lie to me.
- Yonadav Shavit 28 Apr 2022 23:00 UTC
  1 point
  Parent
  I think (3) is not the same as my definition of deception. There are two independent concepts from the Xu post: “deceptive misaligned mesaoptimizers” and “nondeceptive misaligned mesaoptimizers”.
  (3) seems to be describing ordinary misaligned mesaoptimizers (whose proxies no longer generalize on the test distribution).
  I think an agent that you train to keep your diamond safe that learns you’re judging it from cameras may indeed take actions to fool the cameras, but I don’t think it will secretly optimize some other objective while it’s doing that. I agree my argument doesn’t apply to this example.