Is this a straight sum, where negative actvals cancel with positive ones? If so, if you instead summed the absolute values of activations be more indicative of “distributed effort”? Or if only positive actvals have an effect on downstream activations, maybe a better metric for “effort” would be to sum only the positive ones?
I’m not sure whether total actvals for a token is a good measure of the “effort” it takes to process it. Maybe. In brains, the salience (somewhat analogous to actvals) of some input is definitely related to how much effort it takes to process it (as measured by the number of downstream neurons affected by it[1]), but I don’t know enuf about transformers yet to judge if and how it analogises.
For sufficiently salient input, there’s a threshold at which it enters “consciousness”, where it’s processed in a loop for a while affecting a much larger portion of the network compared to inputs that don’t reach the threshold.
Another way transformers are different: every tensor operation involves the same number of cells & bits, so computational resources spent per token processed is constant; unless I’m mistaken?
I am doing a follow up on this one, and apparently the computations I did were misleading. But further reviewing the results led me to another accidental discovery.
Is this a straight sum, where negative actvals cancel with positive ones? If so, if you instead summed the absolute values of activations be more indicative of “distributed effort”? Or if only positive actvals have an effect on downstream activations, maybe a better metric for “effort” would be to sum only the positive ones?
I’m not sure whether total actvals for a token is a good measure of the “effort” it takes to process it. Maybe. In brains, the salience (somewhat analogous to actvals) of some input is definitely related to how much effort it takes to process it (as measured by the number of downstream neurons affected by it[1]), but I don’t know enuf about transformers yet to judge if and how it analogises.
For sufficiently salient input, there’s a threshold at which it enters “consciousness”, where it’s processed in a loop for a while affecting a much larger portion of the network compared to inputs that don’t reach the threshold.
Another way transformers are different: every tensor operation involves the same number of cells & bits, so computational resources spent per token processed is constant; unless I’m mistaken?
I am doing a follow up on this one, and apparently the computations I did were misleading. But further reviewing the results led me to another accidental discovery.