The inner RL algorithm adjusts its learning rate to improve performance.
I have come across a lot of learning rate adjustment schemes in my time, and none of them have been ‘obviously good’, altho I think some have been conceptually simple and relatively easy to find. If this is what’s actually going on and can be backed out, it would be interesting to see what it’s doing here (and whether that works well on its own).
This is more concerning than a thermostat-like bag of heuristics, because an RL algorithm is a pretty agentic thing, which can adapt to new situations and produce novel, clever behavior.
Most RL training algorithms that we have look to me like putting a thermostat on top of a model; I think you’re underestimating deep thermostats.
I have come across a lot of learning rate adjustment schemes in my time, and none of them have been ‘obviously good’, altho I think some have been conceptually simple and relatively easy to find. If this is what’s actually going on and can be backed out, it would be interesting to see what it’s doing here (and whether that works well on its own).
Most RL training algorithms that we have look to me like putting a thermostat on top of a model; I think you’re underestimating deep thermostats.