Many people have argued that recent language models don’t have “real” intelligence and are just doing shallow pattern matching. For example see this recent post.
I don’t really agree with this. I think real intelligence is just a word for deep pattern matching, and our models have been getting progressively deeper at their pattern matching over the years. The machines are not stuck at some very narrow level. They’re just at a moderate depth.
I propose a challenge:
The challenge is to come up with the best prompt that demonstrates that even after 2-5 years of continued advancement, language models will still struggle to do basic reasoning tasks that ordinary humans can do easily.
Here’s how it works.
Name a date (e.g. January 1st 2025), and a prompt (e.g. “What food would you use to prop a book open and why?”). Then, on that date, we should commission a Mechanical Turk task to ask humans to answer the prompt, and ask the best current publicly available language model to answer the same prompt.
Then, we will ask LessWrongers to guess which replies were real human replies, and which ones were machine generated. If LessWrongers can’t do better than random guessing, then the machine wins.
Many people have argued that recent language models don’t have “real” intelligence and are just doing shallow pattern matching. For example see this recent post.
I don’t really agree with this. I think real intelligence is just a word for deep pattern matching, and our models have been getting progressively deeper at their pattern matching over the years. The machines are not stuck at some very narrow level. They’re just at a moderate depth.
I propose a challenge:
The challenge is to come up with the best prompt that demonstrates that even after 2-5 years of continued advancement, language models will still struggle to do basic reasoning tasks that ordinary humans can do easily.
Here’s how it works.
Name a date (e.g. January 1st 2025), and a prompt (e.g. “What food would you use to prop a book open and why?”). Then, on that date, we should commission a Mechanical Turk task to ask humans to answer the prompt, and ask the best current publicly available language model to answer the same prompt.
Then, we will ask LessWrongers to guess which replies were real human replies, and which ones were machine generated. If LessWrongers can’t do better than random guessing, then the machine wins.