A trinary weighted LLM with accuracy comparable to Chinchilla (70B weights) would need significantly more (dense) trits, let’s say >140B?
An LLM with significantly more trit weights is less interpretable than an LLM with a less quantity of float weights?
Do you disagree regarding harm if successful?
Consider that most of the trits will be 0 and thus removable, and that we will be replacing the activations with boolean logic and applying logic simplification transformations to discard even more nodes. The number of trits in the weights is not the same as the number of gates in the resulting logic graph. I think it plausible that even if we are forced to start with a LLM of greater than chinchilla size to achieve comparable accuracy, after sparsification and logic simplification we will end up with significantly fewer gates. Would such a LLM still be less interpretable?
If you want to be competitive with SOTA, a more quantized net will need a lot more neurons (have you read the new article on superposition?).
I agree that lower precision weights will likely requires somewhat more weights, however I do not see the connection to superposition. It is possible to embed >n features in n bits (assuming some feature sparsity). The features will be on the unit corners, but most of the area is there anyway, I do not think it would be a very large decrease in available space.
and that you would still need specialized tools to get anywhere.
I agree with this. I am currently attempting to build the needed tooling. It’s nontrivial work, but I think it is doable.
If you want to be competitive with SOTA, a more quantized net will need a lot more neurons (have you read the new article on superposition?).
I agree that lower precision weights will likely requires somewhat more weights, however I do not see the connection to superposition. It is possible to embed >n features in n bits (assuming some feature sparsity). The features will be on the unit corners, but most of the area is there anyway, I do not think it would be a very large decrease in available space.
The more quantized the weights and activations, the harder it is to embed >n features in n bits without them interfering with each other—interference that stops you from adding together features in semantically sensible ways, or decomposing a state into features. So those small bits aren’t just being wasted—at least I think not, in most parts of modern NNs.
I agree that I think you would need a LOT more weights. Kind of a ridiculous seeming amount perhaps, like maybe 10000x or more. But I actually think that’s a potential strength. I think that reducing super-position and having a very sparse wide network with only a small portion of that network active at any one time could actually be made to be both compute efficient and interpretable. If each of those sparse weights does fewer things, then it becomes much easier to label those specific things, and to see what logic went into any given decision.
As for whether it’s computationally tractable… There’s good reason to think that that’s possible. The brain is basically a very wide sparse net that’s quite computationally efficient. Here’s a recent interview from Yannic Kilcher on the subject:
My view is slightly different, in that I don’t think we should prune down the networks and leave them pruned. I think we want absurdly huge networks with clear labels. I’m currently imagining something that’s like a mixture of experts implemented in this giant wide network, but the experts have significant overlap with each other. So maybe creating this with a series of learn-prune-learn-prune-learn to build up an increasing complex very sparse space.
If we can get the unwanted cognition/behaviors to sit entirely in their own section of weights, we can then ablate the unwanted behaviors without losing wanted capability. That’s my hope anyway.
I agree that reducing superposition is probably valuable even if it requires a significantly larger network. I still don’t understand why the transition from float to binary would cause a dramatic reduction in superposition capacity. But if it does prevent superposition, great! I’ll just give it more parameters as needed. But if we still get superposition, I will need to apply other techniques to make it stop.
(I have not yet finished my closer re-read of Toy Models of Superposition after my initial skimming. Perhaps once I do I will understand better.)
Hopefully in a few months I will have empirical data regarding how much more neurons we need. Then I can stop hand waving about vague intuitions.
If we can get the unwanted cognition/behaviors to sit entirely in their own section of weights, we can then ablate the unwanted behaviors without losing wanted capability. That’s my hope anyway.
I’m glad we agree that RNNs are nice.
So if I understand correctly, you are saying:
A trinary weighted LLM with accuracy comparable to Chinchilla (70B weights) would need significantly more (dense) trits, let’s say >140B?
An LLM with significantly more trit weights is less interpretable than an LLM with a less quantity of float weights?
Do you disagree regarding harm if successful?
Consider that most of the trits will be 0 and thus removable, and that we will be replacing the activations with boolean logic and applying logic simplification transformations to discard even more nodes. The number of trits in the weights is not the same as the number of gates in the resulting logic graph. I think it plausible that even if we are forced to start with a LLM of greater than chinchilla size to achieve comparable accuracy, after sparsification and logic simplification we will end up with significantly fewer gates. Would such a LLM still be less interpretable?
I agree that lower precision weights will likely requires somewhat more weights, however I do not see the connection to superposition. It is possible to embed >n features in n bits (assuming some feature sparsity). The features will be on the unit corners, but most of the area is there anyway, I do not think it would be a very large decrease in available space.
I agree with this. I am currently attempting to build the needed tooling. It’s nontrivial work, but I think it is doable.
The more quantized the weights and activations, the harder it is to embed >n features in n bits without them interfering with each other—interference that stops you from adding together features in semantically sensible ways, or decomposing a state into features. So those small bits aren’t just being wasted—at least I think not, in most parts of modern NNs.
I agree that I think you would need a LOT more weights. Kind of a ridiculous seeming amount perhaps, like maybe 10000x or more. But I actually think that’s a potential strength. I think that reducing super-position and having a very sparse wide network with only a small portion of that network active at any one time could actually be made to be both compute efficient and interpretable. If each of those sparse weights does fewer things, then it becomes much easier to label those specific things, and to see what logic went into any given decision.
As for whether it’s computationally tractable… There’s good reason to think that that’s possible. The brain is basically a very wide sparse net that’s quite computationally efficient. Here’s a recent interview from Yannic Kilcher on the subject:
My view is slightly different, in that I don’t think we should prune down the networks and leave them pruned. I think we want absurdly huge networks with clear labels. I’m currently imagining something that’s like a mixture of experts implemented in this giant wide network, but the experts have significant overlap with each other. So maybe creating this with a series of learn-prune-learn-prune-learn to build up an increasing complex very sparse space.
If we can get the unwanted cognition/behaviors to sit entirely in their own section of weights, we can then ablate the unwanted behaviors without losing wanted capability. That’s my hope anyway.
I agree that reducing superposition is probably valuable even if it requires a significantly larger network. I still don’t understand why the transition from float to binary would cause a dramatic reduction in superposition capacity. But if it does prevent superposition, great! I’ll just give it more parameters as needed. But if we still get superposition, I will need to apply other techniques to make it stop.
(I have not yet finished my closer re-read of Toy Models of Superposition after my initial skimming. Perhaps once I do I will understand better.)
Hopefully in a few months I will have empirical data regarding how much more neurons we need. Then I can stop hand waving about vague intuitions.
My thoughts and hope as well.