generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I don’t know what superintelligent in-context learning is—I’d be skeptical that scaling a transformer a further 3 OOMS will suddenly make it do tasks that are very far from the text distribution it is trained on, indeed solutions to tasks that are not even remotely in the internet text data like building a recursively self-improving agent (if such a thing is possible...)? Maybe I’m misunderstanding what you’re claiming here.
Not saying it’s impossible, just seems deeply implausible. ofc LLMs being so impressive was also a prior implausible but this seems another OOM of implausibility bits if that makes sense?
generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I’m imagining some prompts to generate reasoning, inferred claims about the world. You can’t generate new observations about the world, but you can reason about the observations available so far, and having those inferred claims in the dataset likely helps, that’s how humans build intuition about theory. If an average a 1000 inferred claims are generated for every naturally observed statement (or just those on rare/new/situational topics), that could close the gap of sample efficiency with humans. Might take the form of exercises or essays or something.
If this is all done with prompts, using a sufficiently smart order-following chatbot, then it’s straightforwardly just a transformer, with some superficial scaffolding. If this can work, it’ll eventually appear in distillation literature, though I’m not sure if serious effort to check was actually made with current SOTA LLMs, to pre-train exclusively on synthetic data that’s not too simplistically prompted. Possibly you get nothing for a GPT-3 level generator, and then something for GPT-4+, because reasoning needs to be good enough to preserve contact with ground truth. From Altman’s comments I get the impression that it’s plausibly the exact thing OpenAI is hoping for.
I don’t know what superintelligent in-context learning is
In-context learning is capability to make use of novel data that’s only seen in a context, not in pre-training, to do tasks that make use of this novel data, in ways that normally would’ve been expected to require it being seen in pre-training. In-context learning is a model capability, it’s learned. So its properties are not capped by those of the hardcoded model training algorithm, notably in principle in-context learning could have higher sample efficiency (which might be crucial for generating a lot of synthetic data out of a few rare observations). Right now it’s worse in most respects, but that could change with scale without substantially modifying the transformer architecture, which is the premise of this thread.
By superintelligent in-context learning I mean the capabilities of in-context learning significantly exceeding those of humans. Things like fully comprehending a new paper without changing any model weights, becoming able to immediately write the next one in the same context window. I agree that it’s not very plausible, and probably can’t happen without sufficiently deep circuits, which even deep networks don’t seem to normally develop. But it’s not really ruled out by anything that’s been tried so far. Recent stuff on essentially pre-trainingwith somefrozen weights without losing resulting performance suggests a trend of increasing feasible model size for given compute. So I’m not sure this can’t be done in a few years. Then there’s things like memory transformers, handing a lot more data than a context to a learned learning capability.
generating useful synthetic data and solving novel tasks with little correlation with training data is the exact issue here. Seems straightforwardly true that a transformer arcthiecture doesn’t do that?
I don’t know what superintelligent in-context learning is—I’d be skeptical that scaling a transformer a further 3 OOMS will suddenly make it do tasks that are very far from the text distribution it is trained on, indeed solutions to tasks that are not even remotely in the internet text data like building a recursively self-improving agent (if such a thing is possible...)? Maybe I’m misunderstanding what you’re claiming here.
Not saying it’s impossible, just seems deeply implausible. ofc LLMs being so impressive was also a prior implausible but this seems another OOM of implausibility bits if that makes sense?
I’m imagining some prompts to generate reasoning, inferred claims about the world. You can’t generate new observations about the world, but you can reason about the observations available so far, and having those inferred claims in the dataset likely helps, that’s how humans build intuition about theory. If an average a 1000 inferred claims are generated for every naturally observed statement (or just those on rare/new/situational topics), that could close the gap of sample efficiency with humans. Might take the form of exercises or essays or something.
If this is all done with prompts, using a sufficiently smart order-following chatbot, then it’s straightforwardly just a transformer, with some superficial scaffolding. If this can work, it’ll eventually appear in distillation literature, though I’m not sure if serious effort to check was actually made with current SOTA LLMs, to pre-train exclusively on synthetic data that’s not too simplistically prompted. Possibly you get nothing for a GPT-3 level generator, and then something for GPT-4+, because reasoning needs to be good enough to preserve contact with ground truth. From Altman’s comments I get the impression that it’s plausibly the exact thing OpenAI is hoping for.
In-context learning is capability to make use of novel data that’s only seen in a context, not in pre-training, to do tasks that make use of this novel data, in ways that normally would’ve been expected to require it being seen in pre-training. In-context learning is a model capability, it’s learned. So its properties are not capped by those of the hardcoded model training algorithm, notably in principle in-context learning could have higher sample efficiency (which might be crucial for generating a lot of synthetic data out of a few rare observations). Right now it’s worse in most respects, but that could change with scale without substantially modifying the transformer architecture, which is the premise of this thread.
By superintelligent in-context learning I mean the capabilities of in-context learning significantly exceeding those of humans. Things like fully comprehending a new paper without changing any model weights, becoming able to immediately write the next one in the same context window. I agree that it’s not very plausible, and probably can’t happen without sufficiently deep circuits, which even deep networks don’t seem to normally develop. But it’s not really ruled out by anything that’s been tried so far. Recent stuff on essentially pre-training with some frozen weights without losing resulting performance suggests a trend of increasing feasible model size for given compute. So I’m not sure this can’t be done in a few years. Then there’s things like memory transformers, handing a lot more data than a context to a learned learning capability.