Yep, that sounds right! The measure we’re using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you’re focused on scaling trends.
Yep, that sounds right! The measure we’re using gets noisier with better performance, so even faithfulness-vs-performance breaks down at some point. I think this is mostly an argument to use different metrics and/or tasks if you’re focused on scaling trends.