I am not sure if the calculation in Appendix B is quite accurate; I would like to ask you for a better explanation if I am not quite right.
In the first line (calculation of ‘m’), we can clearly see that there are 4 operations. Now, we could assume that (1-beta1) could be pre-calculated, and hence there are only 3 operations.
If we accept that argument, then in the calculations of ‘m_hat’ and ‘v_hat’, should be considered to have only 1 operation each. I do see the transpose there, which is weird to me too; although PyTorch’s documentation gives the same set of mathematical equations, the default parameters use scalar values for beta1 and beta2.
I am really trying to make sense of the calculation here, but I really can’t. Could you please provide more information on this?
I am not sure if the calculation in Appendix B is quite accurate; I would like to ask you for a better explanation if I am not quite right.
In the first line (calculation of ‘m’), we can clearly see that there are 4 operations. Now, we could assume that (1-beta1) could be pre-calculated, and hence there are only 3 operations.
If we accept that argument, then in the calculations of ‘m_hat’ and ‘v_hat’, should be considered to have only 1 operation each. I do see the transpose there, which is weird to me too; although PyTorch’s documentation gives the same set of mathematical equations, the default parameters use scalar values for beta1 and beta2.
I am really trying to make sense of the calculation here, but I really can’t. Could you please provide more information on this?
t is not a transpose! It is the timestep t. We are raising β to the t-th power.