Helpfullness finetuning might make these models more capable when they’re on the correct side of the debate. Sometimes RLHF(like) models simply perform worse on tasks they’re finetuned to avoid even when they don’t refuse or give up. Would be nice to try base model debaters
We agree this is a major limitation, and discuss this within the Discussion and Appendix section.
We tried using base GPT-4, unfortunately, as it has no helpfulness training—it finds it exceptionally hard to follow instructions. We’d love access to Helpful-only models but currently, no scaling labs offer this.
Helpfullness finetuning might make these models more capable when they’re on the correct side of the debate. Sometimes RLHF(like) models simply perform worse on tasks they’re finetuned to avoid even when they don’t refuse or give up. Would be nice to try base model debaters
Hey Tao,
We agree this is a major limitation, and discuss this within the Discussion and Appendix section.
We tried using base GPT-4, unfortunately, as it has no helpfulness training—it finds it exceptionally hard to follow instructions. We’d love access to Helpful-only models but currently, no scaling labs offer this.
It’s on the list.