There’s an X thread showing that the ordering of answer options is, in several cases, a stronger determinant of the model’s answer than its preferences. While this doesn’t invalidate the paper’s results—they control for this by varying the ordering of the answer results and aggregating the results—, this strikes me as evidence in favor of the “you are not measuring what you think you are measuring” argument, showing that the preferences are relatively weak at best and completely dominated by confounding heuristics at worst.
This doesn’t contradict the Thurstonian model at all. This only show order effects are one of the many factors going in utility variance, one of the factors of the Thurstonian model. Why should it be considered differently than any other such factor? The calculations still show utility variance (including order effects) decrease when scaled (Figure 12), you don’t need to eyeball based on a few examples in a Twitter thread on a single factor.
We responded to the above X thread, and we added an appendix to the paper (Appendix G) explaining how the ordering effects are not an issue but rather a way that some models represent indifference.
I had that vibe from the abstract, but I can try to guess at a specific hypothesis that also explains their data: Instead of a model developing preferences as it grows up, it models an Assistant character’s preferences from the start, but their elicitation techniques work better on larger models; for small models they produce lots of noise.
This interpretation is straightforwardly refuted (insofar as it makes any positivist sense) by the success of the parametric approach in “Internal Utility Representations” being also correlated with model size.
This does go in the direction of refuting it, but they’d still need to argue that linear probes improve with scale faster than they do for other queries; a larger model means there are more possible linear probes to pick the best from.
I don’t see why it should improve faster. It’s generally held that the increase in interpretability in larger models is due to larger models having better representations (that’s why we prefer larger models in the first place), why should it be any different in scale for normative representations?
Why?
There’s an X thread showing that the ordering of answer options is, in several cases, a stronger determinant of the model’s answer than its preferences. While this doesn’t invalidate the paper’s results—they control for this by varying the ordering of the answer results and aggregating the results—, this strikes me as evidence in favor of the “you are not measuring what you think you are measuring” argument, showing that the preferences are relatively weak at best and completely dominated by confounding heuristics at worst.
This doesn’t contradict the Thurstonian model at all. This only show order effects are one of the many factors going in utility variance, one of the factors of the Thurstonian model. Why should it be considered differently than any other such factor? The calculations still show utility variance (including order effects) decrease when scaled (Figure 12), you don’t need to eyeball based on a few examples in a Twitter thread on a single factor.
Hey, first author here.
We responded to the above X thread, and we added an appendix to the paper (Appendix G) explaining how the ordering effects are not an issue but rather a way that some models represent indifference.
I had that vibe from the abstract, but I can try to guess at a specific hypothesis that also explains their data: Instead of a model developing preferences as it grows up, it models an Assistant character’s preferences from the start, but their elicitation techniques work better on larger models; for small models they produce lots of noise.
This interpretation is straightforwardly refuted (insofar as it makes any positivist sense) by the success of the parametric approach in “Internal Utility Representations” being also correlated with model size.
This does go in the direction of refuting it, but they’d still need to argue that linear probes improve with scale faster than they do for other queries; a larger model means there are more possible linear probes to pick the best from.
I don’t see why it should improve faster. It’s generally held that the increase in interpretability in larger models is due to larger models having better representations (that’s why we prefer larger models in the first place), why should it be any different in scale for normative representations?