evhub comments on Alignment By Default

evhub 14 Aug 2020 23:50 UTC
LW: 2 AF: 1
AF
The RL process is actually optimizing $E [\sum_{t} r_{t}]$ , the log just comes from the REINFORCE trick. Regardless, I’m not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don’t know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy $π^{*}$ such that $π^{*} = {argmax}_{π} E [\sum t r_{t} | π] ?$ In that case, that is in fact the definition of outer alignment I’ve given in the past, so I agree that whether $π^{*}$ is aligned or not is an outer alignment question.
What links here?
- “Inner Alignment Failures” Which Are Actually Outer Alignment Failures by johnswentworth (31 Oct 2020 20:18 UTC; 66 points)
- johnswentworth 15 Aug 2020 2:25 UTC
  LW: 6 AF: 3
  AF Parent
  Sure, $π^{*}$ works for what I’m saying, assuming that sum-over-time only includes the timesteps taken thus far. In that case, I’m saying that either:
  - the mesa optimizer doesn’t appear in $π^{*}$ , in which case the problem is fixed by fully optimizing everything at every timestep (i.e. by using $π^{*}$ ), or
  - the mesa optimizer does appear in $π^{*}$ , in which case the problem was really an outer alignment issue all along.