LLM + interpreter is considered neurosymbolic rather than just ‘scale.’
Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn’t any interesting weight behind the claim “neurosymbolic methods are necessary”.
You might as well say “evolution didn’t create intelligent software engineers because humans are much worse at software engineering without access to a python interpreter, so only neurosymbolic intelligence will work”.
GPT-4o was trained on JSONs of the public datasets (which is what Ryan tested on). Unclear how much this could impact performance on the public train and test sets. Would be great to see the performance on the private test set.
I think this is probably fine based on my understanding of how data leakage issues work, but it seems worth checking.
I do think limiting the amount of compute you can use to win the context should be taken into context. Perhaps enough compute solves the task and, in practice, this is mostly all that matters (even if a model + scaffolding can’t solve ARC-AGI under present rules).
I think it seems reasonable to analyze benchmarks while fixing AI compute costs to be 10-100x human costs. But, this is a huge amount of compute because inference is cheap and humans are expensive.
Given that this setup can’t be used for the actual competition, it may be that SoTA (neurosymbolic) models can get a high score a year or more before the models that can enter the competition are allowed.
Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn’t any interesting weight behind the claim “neurosymbolic methods are necessary”.
If somebody achieved a high-score on the ARC challenge by providing the problems to an LLM as prompts and having it return the solutions as output, then the claim “neurosymbolic methods are necessary” would be falsified. So there is weight to the claim. Whether it is interesting or not is obviously in the eye of the beholder.
I get that adding an interpreter feels like a low bar for neurosymbolic (and that scale helps considerably with making interpreters/tools useful in the first place), but I’d be curious to know what you have in mind when you hear that word.
To be honest, I didn’t really have any clear picture of what people meant be neurosymboloic.
I assumed something like massively scaled up expert systems or SVMs with hand engineered features, but I haven’t seen anyone exhibiting a clear case of something that is claimed to be a central example of a neurosymbolic system which also does any very interesting task.
Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn’t any interesting weight behind the claim “neurosymbolic methods are necessary”.
You might as well say “evolution didn’t create intelligent software engineers because humans are much worse at software engineering without access to a python interpreter, so only neurosymbolic intelligence will work”.
I think this is probably fine based on my understanding of how data leakage issues work, but it seems worth checking.
I think it seems reasonable to analyze benchmarks while fixing AI compute costs to be 10-100x human costs. But, this is a huge amount of compute because inference is cheap and humans are expensive.
Yeah, seems likely.
If somebody achieved a high-score on the ARC challenge by providing the problems to an LLM as prompts and having it return the solutions as output, then the claim “neurosymbolic methods are necessary” would be falsified. So there is weight to the claim. Whether it is interesting or not is obviously in the eye of the beholder.
I get that adding an interpreter feels like a low bar for neurosymbolic (and that scale helps considerably with making interpreters/tools useful in the first place), but I’d be curious to know what you have in mind when you hear that word.
To be honest, I didn’t really have any clear picture of what people meant be neurosymboloic.
I assumed something like massively scaled up expert systems or SVMs with hand engineered features, but I haven’t seen anyone exhibiting a clear case of something that is claimed to be a central example of a neurosymbolic system which also does any very interesting task.