For the output control task, we graded models as correct if they were within a certain total variation distance of the target distribution. Half the samples had a requirement of being within 10%, the other of being within 20%. This gets us a binary success (0 or 1) from each sample.
Since models practically never got points from the full task, half the samples were also an easier version, testing only their ability to hit the target distribution when they’re already given the two words (rather than the full task, where they have to both decide the two words themselves, and match the specified distribution).
It’s surprising to me that the ‘given’ setting fails so consistently across models when Anthropic models were found to do well at using gender pronouns equally (50%) c.f. my discussion here.
I suppose this means the capability demonstrated in that post was much more training data-specific and less generalizable than I had imaged.
Yes, it’s plausible to me that this capbility is data specific. E.g. It might also be better with “heads/tails” or “0/1″ because of examples of this in the training data.
For the output control task, we graded models as correct if they were within a certain total variation distance of the target distribution. Half the samples had a requirement of being within 10%, the other of being within 20%. This gets us a binary success (0 or 1) from each sample.
Since models practically never got points from the full task, half the samples were also an easier version, testing only their ability to hit the target distribution when they’re already given the two words (rather than the full task, where they have to both decide the two words themselves, and match the specified distribution).
It’s surprising to me that the ‘given’ setting fails so consistently across models when Anthropic models were found to do well at using gender pronouns equally (50%) c.f. my discussion here.
I suppose this means the capability demonstrated in that post was much more training data-specific and less generalizable than I had imaged.
Yes, it’s plausible to me that this capbility is data specific. E.g. It might also be better with “heads/tails” or “0/1″ because of examples of this in the training data.