“The Floating Droid” example is interesting as there’s a genuine ambiguity in the task specification here. In some sense that means there’s no “good” behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that’s outside the scope of this contest.) But it’s interesting the interpretation flips with model scale, and in the opposite direction to what I’d have predicted (doing EV calculations are harder so I’d have expected scale to increase not decrease EV answers.) Follow-up questions I’d be excited to see the author address include:
1. Does the problem go away if we include an example where EV and actual outcome disagree? Or do the large number of other spuriously correlated examples overwhelm that?
2. How sensitive is this to prompt? Can we prompt it some other way that makes smaller models more likely to do actual outcome, and larger models care about EV? My guess is the training data that’s similar to those prompts does end up being more about actual outcomes (perhaps this says something about the frequency of probabilistic vs non-probabilistic thinking on internet text!), and that larger language models end up capturing that. But perhaps putting the system in a different “personality” is enough to resolve this. “You are a smart, statistical assistant bot that can perform complex calculations to evaluate the outcomes of bets. Now, let’s answer these questions, and think step by step.”
in the opposite direction to what I’d have predicted (doing EV calculations are harder so I’d have expected scale to increase not decrease EV answers.)
I think the inverse scaling here is going from “random answer” to “win/loss detection” rather than “EV calculation” to “win/loss detection”.
“The Floating Droid” example is interesting as there’s a genuine ambiguity in the task specification here. In some sense that means there’s no “good” behavior for a prompted imitation model here. (For an instruction-following model, we might want it to ask for clarification, but that’s outside the scope of this contest.) But it’s interesting the interpretation flips with model scale, and in the opposite direction to what I’d have predicted (doing EV calculations are harder so I’d have expected scale to increase not decrease EV answers.) Follow-up questions I’d be excited to see the author address include:
1. Does the problem go away if we include an example where EV and actual outcome disagree? Or do the large number of other spuriously correlated examples overwhelm that?
2. How sensitive is this to prompt? Can we prompt it some other way that makes smaller models more likely to do actual outcome, and larger models care about EV? My guess is the training data that’s similar to those prompts does end up being more about actual outcomes (perhaps this says something about the frequency of probabilistic vs non-probabilistic thinking on internet text!), and that larger language models end up capturing that. But perhaps putting the system in a different “personality” is enough to resolve this. “You are a smart, statistical assistant bot that can perform complex calculations to evaluate the outcomes of bets. Now, let’s answer these questions, and think step by step.”
I think the inverse scaling here is going from “random answer” to “win/loss detection” rather than “EV calculation” to “win/loss detection”.