Charlie Steiner comments on TurnTrout’s shortform feed

Charlie Steiner 13 Nov 2024 22:31 UTC
LW: 13 AF: 8
0
AF
I agree with many of these criticisms about hype, but I think this rhetorical question should be non-rhetorically answered.
No, that’s not how RL works. RL—in settings like REINFORCE for simplicity—provides a per-datapoint learning rate modifier. How does a per-datapoint learning rate multiplier inherently “incentivize” the trained artifact to try to maximize the per-datapoint learning rate multiplier? By rephrasing the question, we arrive at different conclusions, indicating that leading terminology like “reward” and “incentivized” led us astray.
How does a per-datapoint learning rate modifier inherently incentivize the trained artifact to try to maximize the per-datapoint learning rate multiplier?
For readers familiar with markov chain monte carlo, you can probably fill in the blanks now that I’ve primed you.
For those who want to read on: if you have an energy landscape and you want to find a global minimum, a great way to do it is to start at some initial guess and then wander around, going uphill sometimes and downhill sometimes, but with some kind of bias towards going downhill. See the AlphaPhoenix video for a nice example. This works even better than going straight downhill because you don’t want to get stuck in local minima.
The typical algorithm for this is you sample a step and then always take it if it’s going downhill, but only take it with some probability if it leads uphill (with smaller probability the more uphill it is). But another algorithm that’s very similar is to just take smaller steps when going uphill than when going downhill.
If you were never told about the energy landscape, but you are told about a pattern of larger and smaller steps you’re supposed to take based on stochastically sampled directions, than an interesting question is: when can you infer an energy function that’s implicitly getting optimized for?
Obviously, if the sampling is uniform and the step size when going uphill looks like it could be generated by taking the reciprocal of the derivative of an energy function, you should start getting suspicious. But what if the sampling is nonuniform? What if there’s no cap on step size? What if the step size rule has cycles or other bad behavior? Can you still model what’s going on as a markov chain monte carlo procedure plus some extra stuff?
I don’t know, these seem like interesting questions in learning theory. If you search for questions like “under what conditions does the REINFORCE algorithm find a global optimum,” you find papers like this one that don’t talk about MCMC, so maybe I’ve lost the plot.
But anyhow, this seems like the shape of the answer. If you pick random steps to take but take bigger steps according to some rule, then that rule might be telling you about an underlying energy landscape you’re doing a markov chain monte carlo walk around.