This is pretty good. It has a lot in it, being a grab bag of things. I particularly enjoyed the scalable oversight sections which succinctly explained debate, recursive reward modelling etc. There were also some gems I hadn’t encountered before, like the concept of training out agentic behavior by punishing side-effects.
If anyone wants the HTML version of the paper, it is here.
This is pretty good. It has a lot in it, being a grab bag of things. I particularly enjoyed the scalable oversight sections which succinctly explained debate, recursive reward modelling etc. There were also some gems I hadn’t encountered before, like the concept of training out agentic behavior by punishing side-effects.
If anyone wants the HTML version of the paper, it is here.