ryan_greenblatt comments on Getting 50% (SoTA) on ARC-AGI with GPT-4o

ryan_greenblatt 17 Jun 2024 22:29 UTC
42 points
35

LLM + interpreter is considered neurosymbolic rather than just ‘scale.’

Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn’t any interesting weight behind the claim “neurosymbolic methods are necessary”.

You might as well say “evolution didn’t create intelligent software engineers because humans are much worse at software engineering without access to a python interpreter, so only neurosymbolic intelligence will work”.

GPT-4o was trained on JSONs of the public datasets (which is what Ryan tested on). Unclear how much this could impact performance on the public train and test sets. Would be great to see the performance on the private test set.

I think this is probably fine based on my understanding of how data leakage issues work, but it seems worth checking.

I do think limiting the amount of compute you can use to win the context should be taken into context. Perhaps enough compute solves the task and, in practice, this is mostly all that matters (even if a model + scaffolding can’t solve ARC-AGI under present rules).

I think it seems reasonable to analyze benchmarks while fixing AI compute costs to be 10-100x human costs. But, this is a huge amount of compute because inference is cheap and humans are expensive.

Given that this setup can’t be used for the actual competition, it may be that SoTA (neurosymbolic) models can get a high score a year or more before the models that can enter the competition are allowed.

Yeah, seems likely.
- Paradiddle 19 Jun 2024 11:22 UTC
  5 points
  0
  Parent
  Fair enough if literally any approach using symbolic programs (e.g. a python interpreter) is considered neurosymbolic, but then there isn’t any interesting weight behind the claim “neurosymbolic methods are necessary”.
  If somebody achieved a high-score on the ARC challenge by providing the problems to an LLM as prompts and having it return the solutions as output, then the claim “neurosymbolic methods are necessary” would be falsified. So there is weight to the claim. Whether it is interesting or not is obviously in the eye of the beholder.
- jacquesthibs 17 Jun 2024 22:39 UTC
  4 points
  0
  Parent
  I get that adding an interpreter feels like a low bar for neurosymbolic (and that scale helps considerably with making interpreters/tools useful in the first place), but I’d be curious to know what you have in mind when you hear that word.
  - ryan_greenblatt 18 Jun 2024 0:11 UTC
    4 points
    0
    Parent
    To be honest, I didn’t really have any clear picture of what people meant be neurosymboloic.
    
    I assumed something like massively scaled up expert systems or SVMs with hand engineered features, but I haven’t seen anyone exhibiting a clear case of something that is claimed to be a central example of a neurosymbolic system which also does any very interesting task.