I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space.
In the traditional deceptive mesaoptimization story, the model needs to at some point switch from “pursuing objective X directly” to “pursuing objective Y indirectly by deceptively pursuing objective X”. I agree that, if there isn’t really a core “mesaoptimizer” that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for “pursuing objective Y”.
I’m not sure whether you’re arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you’re arguing that deceptive mesaoptimization isn’t likely in the first place.
A longer reply on the points about heuristic mesaobjectives and the switch:
I will first note here that I’m not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what’s going to happen as we fumble toward AGI.
But putting that aside, and noting that my language is imprecise and confused, here is how I think about the “switch” from directly to deceptively pursuing your training objective:
“Pursuing objective X” is an abstraction we use to think about an agent that manages to robustly take actions that move in the direction of objective X
We can think of an agent as “pursuing X directly” if we think that the agent will take an available option that it can tell moves toward X
We can think of an agent as “pursuing X deceptively” if the agent would stop taking actions that move toward X under some change of context.
Some such “deceptive” agents might be better described as “pursuing Y directly” for some Y.
So an example transition from pursing X “directly” to “deceptively” would be an agent you train to keep your diamonds safe, that eventually learns that you’re judging this via cameras, and will therefore take actions that fool the cameras if they become available.
And notably I don’t think your argument applies to this class of example? It at least doesn’t seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won’t try to lie to me.
I think (3) is not the same as my definition of deception. There are two independent concepts from the Xu post: “deceptive misaligned mesaoptimizers” and “nondeceptive misaligned mesaoptimizers”.
(3) seems to be describing ordinary misaligned mesaoptimizers (whose proxies no longer generalize on the test distribution).
I think an agent that you train to keep your diamond safe that learns you’re judging it from cameras may indeed take actions to fool the cameras, but I don’t think it will secretly optimize some other objective while it’s doing that. I agree my argument doesn’t apply to this example.
And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring “mesaoptimizer” as “stuff + another mesaoptimizer”?
If so, I can’t really update on the results of the argument.
I don’t think it relies on this, but I’m not sure where we’re not seeing eye to eye.
You don’t literally need to be able to factorize out the mesaoptimizer—but insofar as there is some minimum space needed to implement any sort of mesaoptimizer (with heuristics or otherwise), this argument applies to whichever size mesaoptimizer’s tendency to optimize a valid proxy vs. deceptively optimize a proxy to secretly achieve something completely different.
(1) I think the traditional story is more that your agent pursues mostly-X while it’s dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.
(2) I’m mainly arguing that your post isn’t correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.
I very much do not believe that a mesaoptimizer found by gradient descent would look anything like the above Python programs. I’m just using this as a simplification to try and get at trends that I think it represents.
Re: (1) my argument is exactly whether gradient descent would summon an agent with a weird pseudogoal Y that was not itself a proxy for reward on its training distribution. If pursuing Y directly (which is different from the base optimizer goal, e.g. Z)
I’m realizing some of the confusion might be because I named the goal-finding function “get_base_obj” instead of “get_proxy_for_base_obj”. That seems like it would definitely mislead people, I’ll fix that.
I think my argument only gets stronger if you assume that the mesaobjective is a large pile of heuristics built into the mesaoptimization algorithm, since that takes up much more space.
In the traditional deceptive mesaoptimization story, the model needs to at some point switch from “pursuing objective X directly” to “pursuing objective Y indirectly by deceptively pursuing objective X”. I agree that, if there isn’t really a core “mesaoptimizer” that can have goals swapped out, the idea of seemlessly transitioning between the two is very unlikely, since you initially lack the heuristics for “pursuing objective Y”.
I’m not sure whether you’re arguing that my post fails to imply the speed prior disincentivizes deceptive mesaoptimization, or whether you’re arguing that deceptive mesaoptimization isn’t likely in the first place.
A longer reply on the points about heuristic mesaobjectives and the switch:
I will first note here that I’m not a huge fan of the concepts/story from the mesaoptimizers paper as a way of factoring reality. I struggle to map the concepts onto my own model of what’s going to happen as we fumble toward AGI.
But putting that aside, and noting that my language is imprecise and confused, here is how I think about the “switch” from directly to deceptively pursuing your training objective:
“Pursuing objective X” is an abstraction we use to think about an agent that manages to robustly take actions that move in the direction of objective X
We can think of an agent as “pursuing X directly” if we think that the agent will take an available option that it can tell moves toward X
We can think of an agent as “pursuing X deceptively” if the agent would stop taking actions that move toward X under some change of context.
Some such “deceptive” agents might be better described as “pursuing Y directly” for some Y.
So an example transition from pursing X “directly” to “deceptively” would be an agent you train to keep your diamonds safe, that eventually learns that you’re judging this via cameras, and will therefore take actions that fool the cameras if they become available.
And notably I don’t think your argument applies to this class of example? It at least doesn’t seem like I could write down a speed prior that would actually reassure me that my diamond-keeper won’t try to lie to me.
I think (3) is not the same as my definition of deception. There are two independent concepts from the Xu post: “deceptive misaligned mesaoptimizers” and “nondeceptive misaligned mesaoptimizers”.
(3) seems to be describing ordinary misaligned mesaoptimizers (whose proxies no longer generalize on the test distribution).
I think an agent that you train to keep your diamond safe that learns you’re judging it from cameras may indeed take actions to fool the cameras, but I don’t think it will secretly optimize some other objective while it’s doing that. I agree my argument doesn’t apply to this example.
And a final note: none of that seems to matter for my main complaint, which is that the argument in the post seems to rely on factoring “mesaoptimizer” as “stuff + another mesaoptimizer”?
If so, I can’t really update on the results of the argument.
I don’t think it relies on this, but I’m not sure where we’re not seeing eye to eye.
You don’t literally need to be able to factorize out the mesaoptimizer—but insofar as there is some minimum space needed to implement any sort of mesaoptimizer (with heuristics or otherwise), this argument applies to whichever size mesaoptimizer’s tendency to optimize a valid proxy vs. deceptively optimize a proxy to secretly achieve something completely different.
Two quick things to say:
(1) I think the traditional story is more that your agent pursues mostly-X while it’s dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.
(2) I’m mainly arguing that your post isn’t correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.
I very much do not believe that a mesaoptimizer found by gradient descent would look anything like the above Python programs. I’m just using this as a simplification to try and get at trends that I think it represents.
Re: (1) my argument is exactly whether gradient descent would summon an agent with a weird pseudogoal Y that was not itself a proxy for reward on its training distribution. If pursuing Y directly (which is different from the base optimizer goal, e.g. Z)
I’m realizing some of the confusion might be because I named the goal-finding function “get_base_obj” instead of “get_proxy_for_base_obj”. That seems like it would definitely mislead people, I’ll fix that.