YafahEdelman comments on LLM Generality is a Timeline Crux

YafahEdelman Jun 25, 2024, 4:25 AM
8 points
−3
Current LLMs do quite badly on the ARC visual puzzles, which are reasonably easy for smart humans.
We do not in fact have strong evidence for this. There does not exist any baseline for ARC puzzles among humans, smart or otherwise, just a claim that two people the designers asked to attempt them were able to solve them all. It seems entirely plausible to me that the best score on that leaderboard is pretty close to the human median.
Edit: I failed to mention that there is a baseline on the test set, which is different from the eval set that is used for the scoreboard and is, I believe, significantly easier.
- dirk Jun 25, 2024, 6:13 AM
  9 points
  4
  Parent
  Their website cites https://cims.nyu.edu/~brenden/papers/JohnsonEtAl2021CogSci.pdf as having found an average 84% success rate on the tested subset of puzzles.
  - ryan_greenblatt Jun 25, 2024, 9:11 PM
    7 points
    2
    Parent
    It is worth noting that LLM based approachs can perform reasonably well on the train set. For instance, my approach gets 72%.
    
    The LLM based approach works quite differently from how a human would normally solve the problem, and if you give LLMs “only one attempt” or otherwise limit them to do a qualitatively similar amount of reasoning as with humans I think they do considerably worse than humans. (Though to make this “only one attempt” baseline fair, you have to allow for the iteration that humans would normally do.)
  - YafahEdelman Jun 27, 2024, 8:36 PM
    1 point
    0
    Parent
    Yeah, I failed to mention this. Edited to clarify what I meant.
  - eggsyntax Jun 25, 2024, 7:15 AM
    1 point
    0
    Parent
    Thanks for finding a cite. I’ve definitely seen Chollet (on Twitter) give 85% as the success rate on the (easier) training set (and the paper picks problems from the training set as well).
- ryan_greenblatt Jun 25, 2024, 9:09 PM
  5 points
  5
  Parent
  There is important context here.
- ryan_greenblatt Jun 25, 2024, 9:13 PM
  4 points
  2
  Parent
  I also think this is plausible—note that randomly selected examples from the public evaluation set are often considerably harder than the train set on which there is a known MTurk baseline (which is an average of 84%).

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer