We did do human validation on the tasks with Surge: redefine-math, quote-repetition, and hindsight-neglect all got 100% agreement, and NeQA got 98% agreement. I agree though that it seems likely many people would do the task ‘wrong’, so maybe the task would be improved by adding clearer instructions.
The situation feels somewhat like model splintering to me: the few-shot examples fit both patterns but the question doesn’t. The larger models are learning the incorrect generalization.
I think it’s important to note that LMs learning to respond in the same way as the average internet user is in some sense expected but can still be an example of inverse scaling – we would like our models to be smarter than that.
We did do human validation on the tasks with Surge: redefine-math, quote-repetition, and hindsight-neglect all got 100% agreement, and NeQA got 98% agreement. I agree though that it seems likely many people would do the task ‘wrong’, so maybe the task would be improved by adding clearer instructions.
The situation feels somewhat like model splintering to me: the few-shot examples fit both patterns but the question doesn’t. The larger models are learning the incorrect generalization.
I think it’s important to note that LMs learning to respond in the same way as the average internet user is in some sense expected but can still be an example of inverse scaling – we would like our models to be smarter than that.