Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it’s more doable now.
For a variety of reasons the core team behind this paper has moved on to other things, so we won’t get to it in the near future, but it would be great to see others working on this!
Interested to see evaluations on tasks not selected to be reward-hackable and try to make performance closer to competitive with standard RL
Us too! At the time we started this project, we tried some more realistic settings, but it was really hard to get multi-step RL working on LLMs. (Not MONA, just regular RL.) I expect it’s more doable now.
For a variety of reasons the core team behind this paper has moved on to other things, so we won’t get to it in the near future, but it would be great to see others working on this!