thinking at the level of constraints is useful. very sparse rewards offer less constraints on final solution. imitation would offer a lot of constraints (within distribution and assuming very low loss).
a way to see RL/supervised distinction dissolve is to convert back and forth. With a reward as negative token prediction loss, and actions being the set of tokens, we can simulate auto-regressive training with RL (as mentioned by @porby). conversely, you could first train RL policy and then imitate that (in which case why would imitator be any safer?).
also, the level of capabilities and the output domain might affect the differences between sparse/dense reward. even if we completely constrain a CPU simulator (to the point that only one solution remains), we still end up with a thing that can run arbitrary programs. At the point where your CPU simulator can be used without performance penalty to do the complex task that your RL agent was doing, it is hard to say which is safer by appealing to the level of constraints in training.
i think something similar could be said of a future pretrained LLM that can solve tough RL problems simply by being prompted to “simulate the appropriate RL agent”, but i am curious what others think here.
thinking at the level of constraints is useful. very sparse rewards offer less constraints on final solution. imitation would offer a lot of constraints (within distribution and assuming very low loss).
a way to see RL/supervised distinction dissolve is to convert back and forth. With a reward as negative token prediction loss, and actions being the set of tokens, we can simulate auto-regressive training with RL (as mentioned by @porby). conversely, you could first train RL policy and then imitate that (in which case why would imitator be any safer?).
also, the level of capabilities and the output domain might affect the differences between sparse/dense reward. even if we completely constrain a CPU simulator (to the point that only one solution remains), we still end up with a thing that can run arbitrary programs. At the point where your CPU simulator can be used without performance penalty to do the complex task that your RL agent was doing, it is hard to say which is safer by appealing to the level of constraints in training.
i think something similar could be said of a future pretrained LLM that can solve tough RL problems simply by being prompted to “simulate the appropriate RL agent”, but i am curious what others think here.