leogao comments on leogao’s Shortform

leogao 25 Oct 2024 18:19 UTC
5 points
1
for a sufficiently competent policy, the fact that BoN doesn’t update the policy doesn’t mean it leaks any fewer bits of info to the policy than normal RL
- ryan_greenblatt 25 Oct 2024 18:26 UTC
  2 points
  0
  Parent
  Something between training the whole model with RL and BoN is training just the last few layers of the model (for current architectures) with RL and then doing BoN on top as needed to increase performance. This means most of the model won’t know the information (except insofar as the info shows up in outputs) and allows you to get some of the runtime cost reductions of using RL rather than BoN.
  - leogao 25 Oct 2024 18:28 UTC
    2 points
    0
    Parent
    I’m claiming that even if you go all the way to BoN, it still doesn’t necessarily leak less info to the morel
    - ryan_greenblatt 25 Oct 2024 20:42 UTC
      2 points
      0
      Parent
      Oh huh, parse error on me.