A note of caution: when I handcoded weights of a neural network (in my case, to solve a gridworld RL problem), I was able to encode the optimal policy—but the algorithm that was later learned by gradient descent was very different. Partly this was because I only required myself to produce the right action, so I often had the (equivalent of) Q-values for different actions be very very close to each other, whereas the neural network ended up having Q-values that were further apart from each other, which was incentivized by the loss function even though it didn’t make a difference to the optimal policy.
So to the extent you’re trying to learn what a neural net trained by gradient descent would do, I’d recommend that you spend some time looking at the trained neural net to see whether it is using a similar sort of algorithm as the one you’re implementing.
Very cool!
A note of caution: when I handcoded weights of a neural network (in my case, to solve a gridworld RL problem), I was able to encode the optimal policy—but the algorithm that was later learned by gradient descent was very different. Partly this was because I only required myself to produce the right action, so I often had the (equivalent of) Q-values for different actions be very very close to each other, whereas the neural network ended up having Q-values that were further apart from each other, which was incentivized by the loss function even though it didn’t make a difference to the optimal policy.
So to the extent you’re trying to learn what a neural net trained by gradient descent would do, I’d recommend that you spend some time looking at the trained neural net to see whether it is using a similar sort of algorithm as the one you’re implementing.
Agree with this.