a transformer is not a generally intelligent agent and won’t be even if scaled many more OOMs
So far as I can tell, a transformer has three possible blockers (that would need to stand undefeated together): (1) in-context learning plateauing at a level where it’s not able to do even a little bit of useful work without changing model weights, (2) terrible sample efficiency that asks for more data than is available on new or rare/situational topics, and (3) absence of a synthetic data generation process that’s both sufficiently prolific and known not to be useless at that scale.
A need for online learning and terrible sample efficiency are defeated by OOMs if enough useful synthetic data can be generated, which the anemic in-context learning without changing weights might turn out to be sufficient for. This is the case of defeating (3), with others falling as a result.
Another possibility is that much larger multimodal transformers (there is a lot of video) might suffice without synthetic data if a model learns superintelligent in-context learning. SSL is not just about imitating humans, the problems it potentially becomes adept at solving are arbitrarily intricate. So even if it can’t grow further and learn substantially new things within its current architecture/model, it might happen to already be far enough along at inference time to do the necessary redesign on its own. This is the case of defeating (1), leaving it to the model to defeat the others. And it should help with (3) even at non-superintelligent levels.
Failing that, RL demonstrates human level sample efficiency in increasingly non-toy settings, promising that saner amounts of useful synthetic data might suffice, defeating (2), though at this point it’s substantially not-a-transformer.
generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I don’t know what superintelligent in-context learning is—I’d be skeptical that scaling a transformer a further 3 OOMS will suddenly make it do tasks that are very far from the text distribution it is trained on, indeed solutions to tasks that are not even remotely in the internet text data like building a recursively self-improving agent (if such a thing is possible...)? Maybe I’m misunderstanding what you’re claiming here.
Not saying it’s impossible, just seems deeply implausible. ofc LLMs being so impressive was also a prior implausible but this seems another OOM of implausibility bits if that makes sense?
generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I’m imagining some prompts to generate reasoning, inferred claims about the world. You can’t generate new observations about the world, but you can reason about the observations available so far, and having those inferred claims in the dataset likely helps, that’s how humans build intuition about theory. If an average a 1000 inferred claims are generated for every naturally observed statement (or just those on rare/new/situational topics), that could close the gap of sample efficiency with humans. Might take the form of exercises or essays or something.
If this is all done with prompts, using a sufficiently smart order-following chatbot, then it’s straightforwardly just a transformer, with some superficial scaffolding. If this can work, it’ll eventually appear in distillation literature, though I’m not sure if serious effort to check was actually made with current SOTA LLMs, to pre-train exclusively on synthetic data that’s not too simplistically prompted. Possibly you get nothing for a GPT-3 level generator, and then something for GPT-4+, because reasoning needs to be good enough to preserve contact with ground truth. From Altman’s comments I get the impression that it’s plausibly the exact thing OpenAI is hoping for.
I don’t know what superintelligent in-context learning is
In-context learning is capability to make use of novel data that’s only seen in a context, not in pre-training, to do tasks that make use of this novel data, in ways that normally would’ve been expected to require it being seen in pre-training. In-context learning is a model capability, it’s learned. So its properties are not capped by those of the hardcoded model training algorithm, notably in principle in-context learning could have higher sample efficiency (which might be crucial for generating a lot of synthetic data out of a few rare observations). Right now it’s worse in most respects, but that could change with scale without substantially modifying the transformer architecture, which is the premise of this thread.
By superintelligent in-context learning I mean the capabilities of in-context learning significantly exceeding those of humans. Things like fully comprehending a new paper without changing any model weights, becoming able to immediately write the next one in the same context window. I agree that it’s not very plausible, and probably can’t happen without sufficiently deep circuits, which even deep networks don’t seem to normally develop. But it’s not really ruled out by anything that’s been tried so far. Recent stuff on essentially pre-trainingwith somefrozen weights without losing resulting performance suggests a trend of increasing feasible model size for given compute. So I’m not sure this can’t be done in a few years. Then there’s things like memory transformers, handing a lot more data than a context to a learned learning capability.
So far as I can tell, a transformer has three possible blockers (that would need to stand undefeated together): (1) in-context learning plateauing at a level where it’s not able to do even a little bit of useful work without changing model weights, (2) terrible sample efficiency that asks for more data than is available on new or rare/situational topics, and (3) absence of a synthetic data generation process that’s both sufficiently prolific and known not to be useless at that scale.
A need for online learning and terrible sample efficiency are defeated by OOMs if enough useful synthetic data can be generated, which the anemic in-context learning without changing weights might turn out to be sufficient for. This is the case of defeating (3), with others falling as a result.
Another possibility is that much larger multimodal transformers (there is a lot of video) might suffice without synthetic data if a model learns superintelligent in-context learning. SSL is not just about imitating humans, the problems it potentially becomes adept at solving are arbitrarily intricate. So even if it can’t grow further and learn substantially new things within its current architecture/model, it might happen to already be far enough along at inference time to do the necessary redesign on its own. This is the case of defeating (1), leaving it to the model to defeat the others. And it should help with (3) even at non-superintelligent levels.
Failing that, RL demonstrates human level sample efficiency in increasingly non-toy settings, promising that saner amounts of useful synthetic data might suffice, defeating (2), though at this point it’s substantially not-a-transformer.
generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I don’t know what superintelligent in-context learning is—I’d be skeptical that scaling a transformer a further 3 OOMS will suddenly make it do tasks that are very far from the text distribution it is trained on, indeed solutions to tasks that are not even remotely in the internet text data like building a recursively self-improving agent (if such a thing is possible...)? Maybe I’m misunderstanding what you’re claiming here.
Not saying it’s impossible, just seems deeply implausible. ofc LLMs being so impressive was also a prior implausible but this seems another OOM of implausibility bits if that makes sense?
I’m imagining some prompts to generate reasoning, inferred claims about the world. You can’t generate new observations about the world, but you can reason about the observations available so far, and having those inferred claims in the dataset likely helps, that’s how humans build intuition about theory. If an average a 1000 inferred claims are generated for every naturally observed statement (or just those on rare/new/situational topics), that could close the gap of sample efficiency with humans. Might take the form of exercises or essays or something.
If this is all done with prompts, using a sufficiently smart order-following chatbot, then it’s straightforwardly just a transformer, with some superficial scaffolding. If this can work, it’ll eventually appear in distillation literature, though I’m not sure if serious effort to check was actually made with current SOTA LLMs, to pre-train exclusively on synthetic data that’s not too simplistically prompted. Possibly you get nothing for a GPT-3 level generator, and then something for GPT-4+, because reasoning needs to be good enough to preserve contact with ground truth. From Altman’s comments I get the impression that it’s plausibly the exact thing OpenAI is hoping for.
In-context learning is capability to make use of novel data that’s only seen in a context, not in pre-training, to do tasks that make use of this novel data, in ways that normally would’ve been expected to require it being seen in pre-training. In-context learning is a model capability, it’s learned. So its properties are not capped by those of the hardcoded model training algorithm, notably in principle in-context learning could have higher sample efficiency (which might be crucial for generating a lot of synthetic data out of a few rare observations). Right now it’s worse in most respects, but that could change with scale without substantially modifying the transformer architecture, which is the premise of this thread.
By superintelligent in-context learning I mean the capabilities of in-context learning significantly exceeding those of humans. Things like fully comprehending a new paper without changing any model weights, becoming able to immediately write the next one in the same context window. I agree that it’s not very plausible, and probably can’t happen without sufficiently deep circuits, which even deep networks don’t seem to normally develop. But it’s not really ruled out by anything that’s been tried so far. Recent stuff on essentially pre-training with some frozen weights without losing resulting performance suggests a trend of increasing feasible model size for given compute. So I’m not sure this can’t be done in a few years. Then there’s things like memory transformers, handing a lot more data than a context to a learned learning capability.