evhub comments on Formal Solution to the Inner Alignment Problem

evhub 27 Feb 2021 20:12 UTC
LW: 2 AF: 2
AF

I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won’t, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.

a) Does that seem right? and b) Aren’t there some policy gradient methods that don’t face this problem?

This seems wrong to me—even though the Q learner is trained using its own point estimate of the next state, it isn’t, at inference time, given access to that point estimate. The Q learner has to choose its Q values before it knows anything about what the Q value estimates will be of future states, which means it certainly should have to consider different models of what the next transition will be like.
- michaelcohen 28 Feb 2021 11:14 UTC
  LW: 1 AF: 1
  AF Parent
  it certainly should have to consider different models of what the next transition will be like.
  Yeah I was agreeing with that.
  even though the Q learner is trained using its own point estimate of the next state, it isn’t, at inference time, given access to that point estimate.
  Right, but one thing the Q-network, in its forward pass, is trying to reproduce is the point of estimate of the Q-value of the next state (since it doesn’t have access to it). What it isn’t trying to reproduce, because it isn’t trained that way, is multiple models of what the Q-value might be at a given possible next state.