It’s not entirely clear to me that the model is making a mistake with the expected value calculations.
The model’s goal is to complete the pattern given examples. In the other prize winning submissions, the intent of the prompter is pretty clear—e.g. there was an explicit instruction in the “repeat after me” task. But in the expected value case, all the prompts were consistent with either expected value or winning is good/losing is bad. And I think the latter frame is more accessible to people—if you asked a random person, I’m pretty sure they are more likely to go with the hindsight bias analysis.
You could argue that there’s a mismatch between the researcher’s expectations, that an EV calculation is the right way to approach these, and the behavior, but this seems to me to be more like straightforward train/test mismatch rather than anything deep going on.
One potentially interesting follow-on might be to poll humans to see how well they would do, perhaps via mechanical turk or similar. I predict that humans would be ~perfect on redefine and repeat after me, and would perform poorly on the expected value task. So they seem qualitatively different to me.
(I didn’t mention the negation task because I found the example to be confusing—a below average temp might be fine or dangerous depending on the size of the drop. Of course, negation has long been hard in NLP, so it’s perfectly plausible that it’s still a problem with LLMs. And maybe the other examples weren’t so borderline.)
We did do human validation on the tasks with Surge: redefine-math, quote-repetition, and hindsight-neglect all got 100% agreement, and NeQA got 98% agreement. I agree though that it seems likely many people would do the task ‘wrong’, so maybe the task would be improved by adding clearer instructions.
The situation feels somewhat like model splintering to me: the few-shot examples fit both patterns but the question doesn’t. The larger models are learning the incorrect generalization.
I think it’s important to note that LMs learning to respond in the same way as the average internet user is in some sense expected but can still be an example of inverse scaling – we would like our models to be smarter than that.
These are fun to think about.
It’s not entirely clear to me that the model is making a mistake with the expected value calculations.
The model’s goal is to complete the pattern given examples. In the other prize winning submissions, the intent of the prompter is pretty clear—e.g. there was an explicit instruction in the “repeat after me” task. But in the expected value case, all the prompts were consistent with either expected value or winning is good/losing is bad. And I think the latter frame is more accessible to people—if you asked a random person, I’m pretty sure they are more likely to go with the hindsight bias analysis.
You could argue that there’s a mismatch between the researcher’s expectations, that an EV calculation is the right way to approach these, and the behavior, but this seems to me to be more like straightforward train/test mismatch rather than anything deep going on.
One potentially interesting follow-on might be to poll humans to see how well they would do, perhaps via mechanical turk or similar. I predict that humans would be ~perfect on redefine and repeat after me, and would perform poorly on the expected value task. So they seem qualitatively different to me.
(I didn’t mention the negation task because I found the example to be confusing—a below average temp might be fine or dangerous depending on the size of the drop. Of course, negation has long been hard in NLP, so it’s perfectly plausible that it’s still a problem with LLMs. And maybe the other examples weren’t so borderline.)
We did do human validation on the tasks with Surge: redefine-math, quote-repetition, and hindsight-neglect all got 100% agreement, and NeQA got 98% agreement. I agree though that it seems likely many people would do the task ‘wrong’, so maybe the task would be improved by adding clearer instructions.
The situation feels somewhat like model splintering to me: the few-shot examples fit both patterns but the question doesn’t. The larger models are learning the incorrect generalization.
I think it’s important to note that LMs learning to respond in the same way as the average internet user is in some sense expected but can still be an example of inverse scaling – we would like our models to be smarter than that.