In that paper did you guys take a good long look at the output of various sized models throughout training? In addition to looking at the graphs of gold-standard/proxy reward model ratings against KL-divergence. If not, then maybe that’s the discrepancy: perhaps Sherjil was communicating with the LLM and thinking “this is not what we wanted”.
In that paper did you guys take a good long look at the output of various sized models throughout training? In addition to looking at the graphs of gold-standard/proxy reward model ratings against KL-divergence. If not, then maybe that’s the discrepancy: perhaps Sherjil was communicating with the LLM and thinking “this is not what we wanted”.