Did you notice any qualitative trends in responses as you optimized harder for the models of the gold RM? Like, anything aside from just “sounding kind of like instruct-GPT”?
There’s an example in the appendix but we didn’t do a lot of qualitative analysis.
Did you notice any qualitative trends in responses as you optimized harder for the models of the gold RM? Like, anything aside from just “sounding kind of like instruct-GPT”?
There’s an example in the appendix but we didn’t do a lot of qualitative analysis.