(1) I think the traditional story is more that your agent pursues mostly-X while it’s dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.
(2) I’m mainly arguing that your post isn’t correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.
I very much do not believe that a mesaoptimizer found by gradient descent would look anything like the above Python programs. I’m just using this as a simplification to try and get at trends that I think it represents.
Re: (1) my argument is exactly whether gradient descent would summon an agent with a weird pseudogoal Y that was not itself a proxy for reward on its training distribution. If pursuing Y directly (which is different from the base optimizer goal, e.g. Z)
I’m realizing some of the confusion might be because I named the goal-finding function “get_base_obj” instead of “get_proxy_for_base_obj”. That seems like it would definitely mislead people, I’ll fix that.
Two quick things to say:
(1) I think the traditional story is more that your agent pursues mostly-X while it’s dumb, but then gradient descent summons something intelligent with some weird pseudo-goal Y, because this can be selected for when you reward the agent for looking like it pursues X.
(2) I’m mainly arguing that your post isn’t correctly examining the effect of a speed prior. Though I also think that one or both of us are confused about what a mesaoptimizer found by gradient-descent would actually look like, which matters lots for what theoretical models apply in reality.
I very much do not believe that a mesaoptimizer found by gradient descent would look anything like the above Python programs. I’m just using this as a simplification to try and get at trends that I think it represents.
Re: (1) my argument is exactly whether gradient descent would summon an agent with a weird pseudogoal Y that was not itself a proxy for reward on its training distribution. If pursuing Y directly (which is different from the base optimizer goal, e.g. Z)
I’m realizing some of the confusion might be because I named the goal-finding function “get_base_obj” instead of “get_proxy_for_base_obj”. That seems like it would definitely mislead people, I’ll fix that.