for a sufficiently competent policy, the fact that BoN doesn’t update the policy doesn’t mean it leaks any fewer bits of info to the policy than normal RL
Something between training the whole model with RL and BoN is training just the last few layers of the model (for current architectures) with RL and then doing BoN on top as needed to increase performance. This means most of the model won’t know the information (except insofar as the info shows up in outputs) and allows you to get some of the runtime cost reductions of using RL rather than BoN.
for a sufficiently competent policy, the fact that BoN doesn’t update the policy doesn’t mean it leaks any fewer bits of info to the policy than normal RL
Something between training the whole model with RL and BoN is training just the last few layers of the model (for current architectures) with RL and then doing BoN on top as needed to increase performance. This means most of the model won’t know the information (except insofar as the info shows up in outputs) and allows you to get some of the runtime cost reductions of using RL rather than BoN.
I’m claiming that even if you go all the way to BoN, it still doesn’t necessarily leak less info to the morel
Oh huh, parse error on me.