I certainly don’t think SGD is a powerful enough optimization process to do science directly, but it definitely seems powerful enough to find an agent which does do science.
But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we’re left with is a modeling routine that is heuristically considering (and later comparing) very different models. And this should include any model that a human would consider.
I think that is main thread of our argument, but now I’m curious if I was totally off the mark about Q-learning and policy gradient.
but overall I think these sorts of differences are pretty minor and shouldn’t affect whether these approaches can reach general intelligence or not.
I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won’t, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.
a) Does that seem right? and b) Aren’t there some policy gradient methods that don’t face this problem?
I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won’t, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.
a) Does that seem right? and b) Aren’t there some policy gradient methods that don’t face this problem?
This seems wrong to me—even though the Q learner is trained using its own point estimate of the next state, it isn’t, at inference time, given access to that point estimate. The Q learner has to choose its Q values before it knows anything about what the Q value estimates will be of future states, which means it certainly should have to consider different models of what the next transition will be like.
it certainly should have to consider different models of what the next transition will be like.
Yeah I was agreeing with that.
even though the Q learner is trained using its own point estimate of the next state, it isn’t, at inference time, given access to that point estimate.
Right, but one thing the Q-network, in its forward pass, is trying to reproduce is the point of estimate of the Q-value of the next state (since it doesn’t have access to it). What it isn’t trying to reproduce, because it isn’t trained that way, is multiple models of what the Q-value might be at a given possible next state.
I interpreted this bit as talking about RL
But taking us back out of RL, in a wide neural network with selective attention that enables many qualitatively different forward passes, gradient descent seems to be training the way different models get proposed (i.e. the way attention is allocated), since this happens in a single forward pass, and what we’re left with is a modeling routine that is heuristically considering (and later comparing) very different models. And this should include any model that a human would consider.
I think that is main thread of our argument, but now I’m curious if I was totally off the mark about Q-learning and policy gradient.
I had thought that maybe since a Q-learner is trained as if the cached point estimate of the Q-value of the next state is the Truth, it won’t, in a single forward pass, consider different models about what the actual Q-value of the next state is. At most, it will consider different models about what the very next transition will be.
a) Does that seem right? and b) Aren’t there some policy gradient methods that don’t face this problem?
This seems wrong to me—even though the Q learner is trained using its own point estimate of the next state, it isn’t, at inference time, given access to that point estimate. The Q learner has to choose its Q values before it knows anything about what the Q value estimates will be of future states, which means it certainly should have to consider different models of what the next transition will be like.
Yeah I was agreeing with that.
Right, but one thing the Q-network, in its forward pass, is trying to reproduce is the point of estimate of the Q-value of the next state (since it doesn’t have access to it). What it isn’t trying to reproduce, because it isn’t trained that way, is multiple models of what the Q-value might be at a given possible next state.