LLM + interpreter is considered neurosymbolic rather than just ‘scale.’ A weak model couldn’t do it with an interpreter, but this was François’ point: You need a good DL model to guide the search program.
For this reason, I think it’s unfair if people try to dunk on François with something like, “See, scale is all you need.”
François agrees with me; he liked a tweet I shared saying the above, and said: “Obviously combining a code interpreter (which is a symbolic system of enormous complexity) with a LLM is neurosymbolic. AlphaGo was neurosymbolic as well. These are universally accepted definitions.” You can disagree with him on what should be considered neurosymbolic, but I think it’s important for us to know what we all mean here even if we have been using the word differently.
If you are generating lots of programs, checking each one with a symbolic checker (e.g. running the actual code of the program and verifying the output), and selecting those that work, you are doing program synthesis (aka “discrete program search”).
The main issue with program synthesis is combinatorial explosion: the “space of all programs” grows combinatorially with the number of available operators and the size of the program.
The best solution to fight combinatorial explosion is to leverage *intuition* over the structure of program space, provided by a deep learning model. For instance, you can use a LLM to sample a program, or to make branching decision when modifying an existing program.
Deep learning models are inexact and need to be complemented with discrete search and symbolic checking, but they provide a fast way to point to the “right” area of program space. They help you navigate program space, so that your discrete search process has less work to do and becomes tractable.
Here’s a talk I did at a workshop in Davos in March that goes into these ideas in a bit more detail.
GPT-4o was trained on JSONs of the public datasets (which is what Ryan tested on). Unclear how much this could impact performance on the public train and test sets. Would be great to see the performance on the private test set.
I do think limited amount of compute you can use to win the competition should be taken into context. Perhaps enough compute solves the task and, in practice, this is mostly all that matters (even if a model + scaffolding can’t solve ARC-AGI under present rules).
Given that this setup can’t be used for the actual competition, it may be that SoTA (neurosymbolic) models can get a high score a year or more before the models that can enter the competition are allowed.
Though, you’ll probably be able to get a model capable of doing this on a P100 if you first solve it with a larger model and then have the weaker model leverage that.
LLM + interpreter is considered neurosymbolic rather than just ‘scale.’
Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn’t any interesting weight behind the claim “neurosymbolic methods are necessary”.
You might as well say “evolution didn’t create intelligent software engineers because humans are much worse at software engineering without access to a python interpreter, so only neurosymbolic intelligence will work”.
GPT-4o was trained on JSONs of the public datasets (which is what Ryan tested on). Unclear how much this could impact performance on the public train and test sets. Would be great to see the performance on the private test set.
I think this is probably fine based on my understanding of how data leakage issues work, but it seems worth checking.
I do think limiting the amount of compute you can use to win the context should be taken into context. Perhaps enough compute solves the task and, in practice, this is mostly all that matters (even if a model + scaffolding can’t solve ARC-AGI under present rules).
I think it seems reasonable to analyze benchmarks while fixing AI compute costs to be 10-100x human costs. But, this is a huge amount of compute because inference is cheap and humans are expensive.
Given that this setup can’t be used for the actual competition, it may be that SoTA (neurosymbolic) models can get a high score a year or more before the models that can enter the competition are allowed.
Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn’t any interesting weight behind the claim “neurosymbolic methods are necessary”.
If somebody achieved a high-score on the ARC challenge by providing the problems to an LLM as prompts and having it return the solutions as output, then the claim “neurosymbolic methods are necessary” would be falsified. So there is weight to the claim. Whether it is interesting or not is obviously in the eye of the beholder.
I get that adding an interpreter feels like a low bar for neurosymbolic (and that scale helps considerably with making interpreters/tools useful in the first place), but I’d be curious to know what you have in mind when you hear that word.
To be honest, I didn’t really have any clear picture of what people meant be neurosymboloic.
I assumed something like massively scaled up expert systems or SVMs with hand engineered features, but I haven’t seen anyone exhibiting a clear case of something that is claimed to be a central example of a neurosymbolic system which also does any very interesting task.
But is AlphaGo using just the policy network (which still beats pros) ‘neurosymbolic’? What about MuZero, is that neurosymbolic?
The current approach is heavy on the symbolic program part, but I think it would be entirely possible to distill the programs down into the LLM, and simply have the LLM generate/model sequences of image-tokens++program-tokens++image-tokens, and learn from any ‘neurosymbolic’ approach: https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt/comment/59334256 (And then you would have a pure LLM solver which at no point actually runs a Python program, it merely uses them as a scaffold for inner-monologue; and given results on inner-monologue distillation, likely even the Python tokens could be trained away.)
I’d like to provide some additional context:
LLM + interpreter is considered neurosymbolic rather than just ‘scale.’ A weak model couldn’t do it with an interpreter, but this was François’ point: You need a good DL model to guide the search program.
For this reason, I think it’s unfair if people try to dunk on François with something like, “See, scale is all you need.”
François agrees with me; he liked a tweet I shared saying the above, and said: “Obviously combining a code interpreter (which is a symbolic system of enormous complexity) with a LLM is neurosymbolic. AlphaGo was neurosymbolic as well. These are universally accepted definitions.” You can disagree with him on what should be considered neurosymbolic, but I think it’s important for us to know what we all mean here even if we have been using the word differently.
He says more here:
If you are generating lots of programs, checking each one with a symbolic checker (e.g. running the actual code of the program and verifying the output), and selecting those that work, you are doing program synthesis (aka “discrete program search”).
The main issue with program synthesis is combinatorial explosion: the “space of all programs” grows combinatorially with the number of available operators and the size of the program.
The best solution to fight combinatorial explosion is to leverage *intuition* over the structure of program space, provided by a deep learning model. For instance, you can use a LLM to sample a program, or to make branching decision when modifying an existing program.
Deep learning models are inexact and need to be complemented with discrete search and symbolic checking, but they provide a fast way to point to the “right” area of program space. They help you navigate program space, so that your discrete search process has less work to do and becomes tractable.
Here’s a talk I did at a workshop in Davos in March that goes into these ideas in a bit more detail.
GPT-4o was trained on JSONs of the public datasets (which is what Ryan tested on). Unclear how much this could impact performance on the public train and test sets. Would be great to see the performance on the private test set.
I do think limited amount of compute you can use to win the competition should be taken into context. Perhaps enough compute solves the task and, in practice, this is mostly all that matters (even if a model + scaffolding can’t solve ARC-AGI under present rules).
Given that this setup can’t be used for the actual competition, it may be that SoTA (neurosymbolic) models can get a high score a year or more before the models that can enter the competition are allowed.
Though, you’ll probably be able to get a model capable of doing this on a P100 if you first solve it with a larger model and then have the weaker model leverage that.
Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn’t any interesting weight behind the claim “neurosymbolic methods are necessary”.
You might as well say “evolution didn’t create intelligent software engineers because humans are much worse at software engineering without access to a python interpreter, so only neurosymbolic intelligence will work”.
I think this is probably fine based on my understanding of how data leakage issues work, but it seems worth checking.
I think it seems reasonable to analyze benchmarks while fixing AI compute costs to be 10-100x human costs. But, this is a huge amount of compute because inference is cheap and humans are expensive.
Yeah, seems likely.
If somebody achieved a high-score on the ARC challenge by providing the problems to an LLM as prompts and having it return the solutions as output, then the claim “neurosymbolic methods are necessary” would be falsified. So there is weight to the claim. Whether it is interesting or not is obviously in the eye of the beholder.
I get that adding an interpreter feels like a low bar for neurosymbolic (and that scale helps considerably with making interpreters/tools useful in the first place), but I’d be curious to know what you have in mind when you hear that word.
To be honest, I didn’t really have any clear picture of what people meant be neurosymboloic.
I assumed something like massively scaled up expert systems or SVMs with hand engineered features, but I haven’t seen anyone exhibiting a clear case of something that is claimed to be a central example of a neurosymbolic system which also does any very interesting task.
But is AlphaGo using just the policy network (which still beats pros) ‘neurosymbolic’? What about MuZero, is that neurosymbolic?
The current approach is heavy on the symbolic program part, but I think it would be entirely possible to distill the programs down into the LLM, and simply have the LLM generate/model sequences of image-tokens++program-tokens++image-tokens, and learn from any ‘neurosymbolic’ approach: https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt/comment/59334256 (And then you would have a pure LLM solver which at no point actually runs a Python program, it merely uses them as a scaffold for inner-monologue; and given results on inner-monologue distillation, likely even the Python tokens could be trained away.)