I’m surprised that there hasn’t been more of a shift to ternary weights a la BitNet 1.58.
What stood out to me in that paper was the perplexity gains over fp weights in equal parameter match-ups, and especially the growth in the advantage as the parameter sizes increased (though only up to quite small model sizes in that paper, which makes me curious about the potential delta in modern SotA scales).
This makes complete sense from the standpoint of the superposition hypothesis (irrespective of its dimensionality, an ongoing discussion).
If nodes are serving more than one role in a network, then constraining the weight to a ternary value as opposed to a floating point range seems like it would be more frequently forcing the network to restructure overlapping node usage to better align nodes to shared directional shifts (positive, negative, or no-op) as opposed to compromise across multiple roles to a floating point avg of the individual role changes.
(Essentially resulting in a sharper vs more fuzzy network mapping.)
A lot of the attention for the paper was around the idea of the overall efficiency gains given the smaller memory footprint, but it really seems like even if there were no additional gains, that models being pretrained from this point onward should seriously consider clamping node precision to improve both the overall network performance and likely make interpretability more successful down the road to boot.
It may be that at the scales we are already at, the main offering of such an approach would be the perplexity advantages over fp weights, with the memory advantages as the beneficial side effect instead?
I’m surprised that there hasn’t been more of a shift to ternary weights a la BitNet 1.58.
What stood out to me in that paper was the perplexity gains over fp weights in equal parameter match-ups, and especially the growth in the advantage as the parameter sizes increased (though only up to quite small model sizes in that paper, which makes me curious about the potential delta in modern SotA scales).
This makes complete sense from the standpoint of the superposition hypothesis (irrespective of its dimensionality, an ongoing discussion).
If nodes are serving more than one role in a network, then constraining the weight to a ternary value as opposed to a floating point range seems like it would be more frequently forcing the network to restructure overlapping node usage to better align nodes to shared directional shifts (positive, negative, or no-op) as opposed to compromise across multiple roles to a floating point avg of the individual role changes.
(Essentially resulting in a sharper vs more fuzzy network mapping.)
A lot of the attention for the paper was around the idea of the overall efficiency gains given the smaller memory footprint, but it really seems like even if there were no additional gains, that models being pretrained from this point onward should seriously consider clamping node precision to improve both the overall network performance and likely make interpretability more successful down the road to boot.
It may be that at the scales we are already at, the main offering of such an approach would be the perplexity advantages over fp weights, with the memory advantages as the beneficial side effect instead?