To me as a programmer and not a mathematitian, the distinction doesn’t make practical intuitive sense.
If we can create 3 functions f, g, h so that they “do the same thing” like f(a, b, c) == g(a)(b)(c) == average(h(a), h(b), h(c)), it seems to me that cross-entropy can “do the same thing” as some particular objective function that would explicitly mention multiple future tokens.
My intuition is that cross-entropy-powered “local accuracy” can approximate “global accuracy” well enough in practice that I should expect better global reasoning from larger model sizes, faster compute, algorithmic improvements, and better data.
Implications of this intuition might be:
myopia is a quantity not a quality, a model can be incentivized to be more or less myopic, but I don’t expect it will be proven possible to enforce it “in the limit”
instruct training on longer conversations outght to produce “better” overall conversations if the model simulates that it’s “in the middle” of a conversation and follow-up questions are better compared to giving a final answer “when close to the end of this kind of conversation”
What nuance should I consider to understand the distinction better?
To me as a programmer and not a mathematitian, the distinction doesn’t make practical intuitive sense.
If we can create 3 functions
f, g, h
so that they “do the same thing” likef(a, b, c) == g(a)(b)(c) == average(h(a), h(b), h(c))
, it seems to me that cross-entropy can “do the same thing” as some particular objective function that would explicitly mention multiple future tokens.My intuition is that cross-entropy-powered “local accuracy” can approximate “global accuracy” well enough in practice that I should expect better global reasoning from larger model sizes, faster compute, algorithmic improvements, and better data.
Implications of this intuition might be:
myopia is a quantity not a quality, a model can be incentivized to be more or less myopic, but I don’t expect it will be proven possible to enforce it “in the limit”
instruct training on longer conversations outght to produce “better” overall conversations if the model simulates that it’s “in the middle” of a conversation and follow-up questions are better compared to giving a final answer “when close to the end of this kind of conversation”
What nuance should I consider to understand the distinction better?