The RL process is actually optimizing E[∑trt], the log just comes from the REINFORCE trick. Regardless, I’m not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don’t know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy π∗ such that
π∗=argmaxπE[∑trt|π]?
In that case, that is in fact the definition of outer alignment I’ve given in the past, so I agree that whether π∗ is aligned or not is an outer alignment question.
The RL process is actually optimizing E[∑trt], the log just comes from the REINFORCE trick. Regardless, I’m not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don’t know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy π∗ such that π∗=argmaxπE[∑trt | π]? In that case, that is in fact the definition of outer alignment I’ve given in the past, so I agree that whether π∗ is aligned or not is an outer alignment question.
Sure, π∗ works for what I’m saying, assuming that sum-over-time only includes the timesteps taken thus far. In that case, I’m saying that either:
the mesa optimizer doesn’t appear in π∗, in which case the problem is fixed by fully optimizing everything at every timestep (i.e. by using π∗), or
the mesa optimizer does appear in π∗, in which case the problem was really an outer alignment issue all along.