This may be a naive question/observation, but it seems to me like treating the trace as a vector conflates the behavior before convergence and after convergence. Note that my intuition is that SGLD is basically an MCMC technique, so if I’m misunderstanding, it’s probably because of that.
After convergence, the samples should be viewed as drawn from the stationary distribution, and ideally they have low autocorrelation, so it doesn’t seem to make sense to treat them as a vector, since there should be many equivalent traces. I’d be interested to see density estimates per sample point, not just the per-timestep ones you have.
Before convergence, it does make sense to look at the vector, but it’s still noisy, so I’d also be interested in certain summaries of it; for example, time to convergence and slope of loss (which is a function of time and the mean loss of the stationary distribution).
I also find myself wondering about other, more tangential things:
Could you use a per-sample gradient trace (rather than the loss trace) of the SGLD to learn something?
Does it make sense to e.g. run multiple chains?
Does it make sense to change the temperature throughout the run (like simulated annealing) rather than just run with each temperature?
Thanks!
I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.