Human brains are most likely undertrained on text data.
Looking at the scaling laws from DeepMind’s Chinchilla work, it looks like a system with as many parameters and as much compute as the brain should be trained on vastly more text than any human can read in their lifetime. Thus, it seems plausible that “lack of training data” is a significant bottleneck on human cognitive capabilities.
Thus, the question: how do we best increase the efficiency of our own text training process? There are two dimensions to this question:
What text should we include?
High quality training text, which is related to the downstream domain of interest, is best. Finding such text is often a bottleneck in consuming it, so ideally we’d compile a large corpus of excellent quality human pretraining text.
One option might be to train a “relevant text classifier”, which identifies well-written text about domains that are useful for alignment, such as alignment research, ML, math, neuroscience, biology, game theory, etc. Then, use that classifier to scour the internet and all journals / books / etc. for useful text and compile the results.
How should we “train” on as much text as possible?
One simple option is to just read more, but reading is slow. There’s only so much time in the day, and reading takes time away from other activities.
Another option is to convert the text into audio form using text to speech. This makes it vastly more convenient to listen to large quantities of text, but has other issues such as:
Images are unavailable.
Pausing or replaying past text is often inconvenient.
Math or LaTex are almost never captured well.
The third option, and the one I believe would be most powerful / scalable, is to use a multi-modal pretrained model to convert the text + images + math into latent representations, then to feed those latents (or dimension-reduced versions of those latents) to the human via one or more of their sensory channels.
What I mean by that is to have some way of translating the latent representation of the text into sensory input for the human, e.g., into an audio signal played into the human’s ears, or into vibrations which a system such as the Eyeronman vests delivers as tactile sensations.
Doing so would require training the human to decode these sensory representations, but that should be manageable. We’d show the human side-by-side instances of the original text / images / math along with the compressed sensory representations.
This approach allows for a greater throughput of information into the human, while avoiding the expense, risk and technical difficulties associated with brain computer interfaces.
It also allows the human to take advantage of the preprocessing provided by the pretrained transformer, meaning that the effective compute available to the human increases as well.
It may also lead to a form of “knowledge distillation” from the pretrained transformer to the human.
Knowledge distillation is an approach in machine learning where the latent knowledge contained in a larger, more powerful ML model is “distilled” into a smaller model. The typical approach is for the smaller model to be trained to imitate the latent representations of the larger model on some reduced corpus of training data.
In this case, the human learns to process the latents generated by the model. If those latents contain representations of super-humanly performant abstractions, the human may pick up on such abstractions.
This could potentially aid in interpretability as well, if the human in question develops a sense for the model’s internal representations.
We could also reduce the dimensionality of the model’s latent representations or strip away irrelevant information so as to further increase the richness / density of the human’s input data.
Overall, the approach that I think would be most effective is:
Collect a corpus of high quality text, images and math from books and articles that might be alignment relevant (maybe ~20 GB of text).
Take a multi-modal transformer pretrained on much more data than the corpus (basically, you’d use the best multi-modal model available).
Find some scalable method of translating the model’s latents into information-dense, human-learnable sensory input.
These would initially appear like random noise / images / vibrations to a human, but with exposure, start to make sense as the brain adapts to the new encoding.
Analogously, the Eyeronman vests I mentioned translate 3-D scene representations into vibrations. After enough time with one, people can pick up a sense of what the environment around them is like through the vibrations from the vest.
Translate the corpus into those sensory representations.
Feed them to the human.
This is a pretty basic setup. Information only flows in one direction, from the model to the human. Most likely, there are ways of improving things by having the model learn to produce latent representations that are more useful for the cognitive tasks the human intends to perform. E.g., the methodology in “Training Language Models with Language Feedback” can be adapted so that the human can provide feedback on what sorts of things the model should focus more / less on.
Note that the approach of translating external information into sensory inputs handles the “getting lots of information into the human” problem. The “getting lots of information out of the human” problem isn’t quite so easy to handle. Humans receive more information from their senses than they transmit via their actions, so just watching human actions probably doesn’t have as high a throughput. Potentially, we can use non-invasive brain imaging tech, which seem to be progressing faster than “read + write” brain computer interfaces. Having high-throughput input + output channels for the brain would let us properly do the whole “merging with technology thing” and keep up with mildly superhuman AIs[1], for a while at least.
I expect horse versus automobiles comparisons in response to this point. I think this analogy isn’t actually illuminating here because learning systems can be combined together much more easily than physical systems. E.g., deepmind’s multi-modal Flamingo model took frozen layers from the text-only Chinchilla model and integrated them with a smaller number of trainable parameters for handling the image and image-to-text side of things. Provided you let two learning systems adapt to each other (or even just let one learning system adapt to the other), it’s relatively straightforward to combine learning systems together.
The idea of generating and directly transferring a pre-digested latent representation is super interesting, but my prior is that this couldn’t work. However a neural network trained from initially randomized weights represents concepts is likely to be highly idiosyncratic to that particular network. Perhaps this could be accomplished between AIs if we can somehow make that process and initial state less random, but how could that ever work for humans?
The highest-bandwith sensory input for humans is their eyes. Doesn’t this idea just amount to diagrams of high-dimensional data?
It works for AIs very easily. Just feed the patents from AI 1 into AI 2. No need for special engineering of the two AIs.
It also works for humans, at least somewhat. E.g., the Eyeronman vests I mentioned translate 3-D scene representations into vibrations. After enough time with one, people can pick up a sense of what the environment around them is like through the vibrations from the vest.
Translating LLM patents into visual input wouldn’t look like normal diagrams. It would look like a random-seeming mishmash of colors and shapes which encode the LLM’s latents. A person would then be shown many pairs of text and the encoded latents the model generated for the text. In time, I expect the person would gain a “text sense” where they can infer the meaning of the text from just the visual encoding of the model’s latents.
I think I’m lacking some jargon here. What’s a latent/patent in the context of a large language model? “patent” is ungoogleable if you’re not talking about intellectual property law.
The Eyeronman link didn’t seem very informative. No explanation of how it works. I already knew sensory substitution was a thing, but is this different somehow? Is there some neural net pre-digesting its outputs? Is it similarly a random-seeming mismash? Are there any other examples of this kind of thing working for humans? Visually?
Would the mismash from a smaller text model be any easier/faster for the human to learn?
Human brains are most likely undertrained on text data.
Looking at the scaling laws from DeepMind’s Chinchilla work, it looks like a system with as many parameters and as much compute as the brain should be trained on vastly more text than any human can read in their lifetime. Thus, it seems plausible that “lack of training data” is a significant bottleneck on human cognitive capabilities.
Thus, the question: how do we best increase the efficiency of our own text training process? There are two dimensions to this question:
What text should we include?
High quality training text, which is related to the downstream domain of interest, is best. Finding such text is often a bottleneck in consuming it, so ideally we’d compile a large corpus of excellent quality human pretraining text.
One option might be to train a “relevant text classifier”, which identifies well-written text about domains that are useful for alignment, such as alignment research, ML, math, neuroscience, biology, game theory, etc. Then, use that classifier to scour the internet and all journals / books / etc. for useful text and compile the results.
How should we “train” on as much text as possible?
One simple option is to just read more, but reading is slow. There’s only so much time in the day, and reading takes time away from other activities.
Another option is to convert the text into audio form using text to speech. This makes it vastly more convenient to listen to large quantities of text, but has other issues such as:
Images are unavailable.
Pausing or replaying past text is often inconvenient.
Math or LaTex are almost never captured well.
The third option, and the one I believe would be most powerful / scalable, is to use a multi-modal pretrained model to convert the text + images + math into latent representations, then to feed those latents (or dimension-reduced versions of those latents) to the human via one or more of their sensory channels.
What I mean by that is to have some way of translating the latent representation of the text into sensory input for the human, e.g., into an audio signal played into the human’s ears, or into vibrations which a system such as the Eyeronman vests delivers as tactile sensations.
Doing so would require training the human to decode these sensory representations, but that should be manageable. We’d show the human side-by-side instances of the original text / images / math along with the compressed sensory representations.
This approach allows for a greater throughput of information into the human, while avoiding the expense, risk and technical difficulties associated with brain computer interfaces.
It also allows the human to take advantage of the preprocessing provided by the pretrained transformer, meaning that the effective compute available to the human increases as well.
It may also lead to a form of “knowledge distillation” from the pretrained transformer to the human.
Knowledge distillation is an approach in machine learning where the latent knowledge contained in a larger, more powerful ML model is “distilled” into a smaller model. The typical approach is for the smaller model to be trained to imitate the latent representations of the larger model on some reduced corpus of training data.
In this case, the human learns to process the latents generated by the model. If those latents contain representations of super-humanly performant abstractions, the human may pick up on such abstractions.
This could potentially aid in interpretability as well, if the human in question develops a sense for the model’s internal representations.
We could also reduce the dimensionality of the model’s latent representations or strip away irrelevant information so as to further increase the richness / density of the human’s input data.
Overall, the approach that I think would be most effective is:
Collect a corpus of high quality text, images and math from books and articles that might be alignment relevant (maybe ~20 GB of text).
Take a multi-modal transformer pretrained on much more data than the corpus (basically, you’d use the best multi-modal model available).
Find some scalable method of translating the model’s latents into information-dense, human-learnable sensory input.
These would initially appear like random noise / images / vibrations to a human, but with exposure, start to make sense as the brain adapts to the new encoding.
Analogously, the Eyeronman vests I mentioned translate 3-D scene representations into vibrations. After enough time with one, people can pick up a sense of what the environment around them is like through the vibrations from the vest.
Translate the corpus into those sensory representations.
Feed them to the human.
This is a pretty basic setup. Information only flows in one direction, from the model to the human. Most likely, there are ways of improving things by having the model learn to produce latent representations that are more useful for the cognitive tasks the human intends to perform. E.g., the methodology in “Training Language Models with Language Feedback” can be adapted so that the human can provide feedback on what sorts of things the model should focus more / less on.
Note that the approach of translating external information into sensory inputs handles the “getting lots of information into the human” problem. The “getting lots of information out of the human” problem isn’t quite so easy to handle. Humans receive more information from their senses than they transmit via their actions, so just watching human actions probably doesn’t have as high a throughput. Potentially, we can use non-invasive brain imaging tech, which seem to be progressing faster than “read + write” brain computer interfaces. Having high-throughput input + output channels for the brain would let us properly do the whole “merging with technology thing” and keep up with mildly superhuman AIs[1], for a while at least.
I expect horse versus automobiles comparisons in response to this point. I think this analogy isn’t actually illuminating here because learning systems can be combined together much more easily than physical systems. E.g., deepmind’s multi-modal Flamingo model took frozen layers from the text-only Chinchilla model and integrated them with a smaller number of trainable parameters for handling the image and image-to-text side of things. Provided you let two learning systems adapt to each other (or even just let one learning system adapt to the other), it’s relatively straightforward to combine learning systems together.
The idea of generating and directly transferring a pre-digested latent representation is super interesting, but my prior is that this couldn’t work. However a neural network trained from initially randomized weights represents concepts is likely to be highly idiosyncratic to that particular network. Perhaps this could be accomplished between AIs if we can somehow make that process and initial state less random, but how could that ever work for humans?
The highest-bandwith sensory input for humans is their eyes. Doesn’t this idea just amount to diagrams of high-dimensional data?
It works for AIs very easily. Just feed the patents from AI 1 into AI 2. No need for special engineering of the two AIs.
It also works for humans, at least somewhat. E.g., the Eyeronman vests I mentioned translate 3-D scene representations into vibrations. After enough time with one, people can pick up a sense of what the environment around them is like through the vibrations from the vest.
Translating LLM patents into visual input wouldn’t look like normal diagrams. It would look like a random-seeming mishmash of colors and shapes which encode the LLM’s latents. A person would then be shown many pairs of text and the encoded latents the model generated for the text. In time, I expect the person would gain a “text sense” where they can infer the meaning of the text from just the visual encoding of the model’s latents.
I think I’m lacking some jargon here. What’s a latent/patent in the context of a large language model? “patent” is ungoogleable if you’re not talking about intellectual property law.
The Eyeronman link didn’t seem very informative. No explanation of how it works. I already knew sensory substitution was a thing, but is this different somehow? Is there some neural net pre-digesting its outputs? Is it similarly a random-seeming mismash? Are there any other examples of this kind of thing working for humans? Visually?
Would the mismash from a smaller text model be any easier/faster for the human to learn?
My money’s on: typo.
Oooh, this is very promising! I had a semi-similar idea for images instead of text, basically like this but in reverse.