Aprillion comments on Transformers Represent Belief State Geometry in their Residual Stream

Aprillion 18 Apr 2024 14:12 UTC
1 point
0
To me as a programmer and not a mathematitian, the distinction doesn’t make practical intuitive sense.

If we can create 3 functions f, g, h so that they “do the same thing” like f(a, b, c) == g(a)(b)(c) == average(h(a), h(b), h(c)), it seems to me that cross-entropy can “do the same thing” as some particular objective function that would explicitly mention multiple future tokens.

My intuition is that cross-entropy-powered “local accuracy” can approximate “global accuracy” well enough in practice that I should expect better global reasoning from larger model sizes, faster compute, algorithmic improvements, and better data.

Implications of this intuition might be:
- myopia is a quantity not a quality, a model can be incentivized to be more or less myopic, but I don’t expect it will be proven possible to enforce it “in the limit”
- instruct training on longer conversations outght to produce “better” overall conversations if the model simulates that it’s “in the middle” of a conversation and follow-up questions are better compared to giving a final answer “when close to the end of this kind of conversation”
What nuance should I consider to understand the distinction better?