Find a sequence of words that is: − 20 words long—contains exactly 2 repetitions of the same word twice in a row—contains exactly 2 repetitions of the same word thrice in a row
Here is its attempt. I add usual boilerplate about being fine to think before answering. First it gives a valid sequence using letters instead of words. I ask for words instead of letters and then it gives a sequence that is only 18 words long. I ask for 20 words and then it finally gets it.
Here’s a second try where I use a disambiguated version of your prompt (without boilerplate) and don’t provide hints beyond “I’m not satisfied, try harder”—the model ends up producing a sequence with placeholders like “unique8″ instead of words, and although I keep saying I’m unsatisfied it makes up nonsensical explanations for why and can’t figure out the real problem. It gets it immediately when I point out that I’m unhappy because “unique8” isn’t a word.
(This is without any custom instructions; it also seems able to do the task without code and its decision of whether to use code is very sensitive to even apparently unrelated instructions.)
I think it’s very likely that GPT-4 with more fine-tuning for general competence will be able to solve this task, and that with fine-tuning or a system prompt for persistence it will need it would not need the “I’m not satisfied, try harder” reminder and will instead keep thinking until its answer is stable on reflection.
I didn’t see a more complicated version in the thread, but I think it’s quite likely that whatever you wrote will also be solved in 2024. I’d wildly guess a 50% chance that by the end of 2024 you will be unable (with an hour of tinkering, say) to design a task like this that’s easy for humans (in the sense that say at least 5% of college graduates can do it within 5 minutes) but hard for the best public agent built with the best public model.
I tested it on 3 held-out problems and it got 1⁄3. Significant progress, increases the chance these can be solved with prompting. So partially it’s a question of if any major LLMs incorporate better auto prompting.
I’m glad you have held out problems, and I think it would be great if you had a handful (like 3) rather than just one. (If you have 5-10 it would also be cool to plot the success rate going up over time as ChatGPT improves.)
Here is the result of running your prompt with a generic system prompt (asking for an initial answer + refinement). It fails to meet the corner condition (and perplexingly says “The four corners (top left ‘A’, top right ‘A’, bottom left ‘A’, bottom right ‘B’) are distinct.”). When I point out that the four corners aren’t distinct it fixes this problem and gets it correct.
I’m happy to call this a failure until the model doesn’t need someone to point out problems. But I think that’s entirely fine-tuning and prompting and will probably be fixed on GPT-4.
That said, I agree that if you keep making these problems more complicated you will be able to find something that’s still pretty easy for a human (<5 minutes for the top 5% of college grads) and stumps the model. E.g. I tried: fill in a 4 x 4 grid such that one column and row have the same letter 4 times, a second column has the same letter 3 times, and all other rows and columns have distinct letters (here’s the model’s attempt). I’m predicting that this will no longer work by EOY 2024.
I can’t tell if you think these problems will remain hard for the model, and if so why.
I think 70% that an LM agent can do the 4x4 grid example by EOY 2024 because it seems pretty easy. I’d update if that was wrong. (And I’d be fine replacing that by held out examples of similar complexity.)
Will you be updating your picture if it can do these tasks by EOY? How much have you updated in the last few years? I feel like 2018 Paul was pretty surprised by how good ChatGPT is now (its turing test ability is maybe ~85th percentile of my forecasts), and that in 2018 you were at least qualitatively trying to argue in the opposite direction.
I think they will remain hard by EOY 2024, as in, of this problem and the 7 held-out ones of similar difficulty, the best LLM will probably not solve 4⁄8.
I think I would update some on how fast LLMs are advancing but these are not inherently very hard problems so I don’t think it would be a huge surprise, this was meant to be one of the easiest things they fail at right now. Maybe if that happens I would think things are going 1.6x as fast short term as I would have otherwise thought?
I was surprised by GPT3/3.5 but not so much by 4, I think it adds up to on net an update that LLMs are advancing faster than I thought, but I haven’t much changed my long-term AGI timelines, because I think that will involve lots of techs not just LLMs, although LLM progress is some update about general tech progress.
Do you have any hard things that you are confident LLMs won’t do soon? (Short of: “autonomously carry out R&D.”) Any tasks you think an LM agent won’t be able to achieve?
Beat Ocarina of Time with <100 hours of playing Zelda games during training or deployment (but perhaps training on other games), no reading guides/walkthroughs/playthroughs, no severe bug exploits (those that would cut down the required time by a lot), no reward-shaping/advice specific to this game generated by humans who know non-trivial things about the game (but the agent can shape its own reward). Including LLM coding a program to do it. I’d say probably not by 2033.
It seems fairly unlikely that this specific task will be completed soon for a variety of reasons: it sounds like it technically requires training a new LM that removes all data about zelda games; it involves a fair amount of videogame-specific engineering hassle; and it’s far from anything with obvious economic relevance + games are out of fashion (not because they are too hard). I do still think it will be done before 2033.
If we could find a similar task that was less out of the way then I’d probably be willing to bet on it happening much sooner. Presumably this is an analogy to something that would be relevant for AI systems automating R&D and is therefore closer to what people are interested in doing with LMs.
Although we can’t bet on it, I do think that if AI developers made a serious engineering effort on the zelda task right now then they would have a reasonable chance of success within 2 years (I’d wildly guess 25%), and this will rise over time. I think GPT-4 with vision will do a reasonable job of identifying the next step needed to complete the game, and models trained with RL to follow instructions in video games across a broad variety of games (including 3d games with similar controls and perspective to Zelda) would likely be competent enough to solve most of the subtasks if you really went all out on it.
I don’t have a good sense of what part you think is hard. I’d guess that the most technically uncertain part is training an RL policy that takes a description of a local task (e.g. “throw a bomb so that it explodes next to the monster’s eye”) and then actually executing it. But my sense is that you might be more concerned about high-level planning.
I think it’s hard because it requires some planning and puzzle solving in a new, somewhat complex environment. The AI results on Montezuma’s Revenge seem pretty unimpressive to me because they’re going to a new room, trying random stuff until they make progress, then “remembering” that for future runs. Which means they need quite a lot of training data.
For short term RL given lots of feedback, there are already decent results e.g. in starcraft and DOTA. So the difficulty is more figuring out how to automatically scope out narrow RL problems that can be learned without too much training time.
It’s almost like the model needs some kind of introspection, where it can learn when a given tool is more or less likely to produce a correct result, and then produce a solution with that strategy every run.
Running the prompt several times over resulted in it guessing the answer, writing a different python program, using placeholder words, and so on. As a user we want the maximum probability of the correct answer.
I don’t see how that’s a valid interpretation of the rules. Isn’t it checking to find that there is at least one 2x repetition and at least one 3x repetition? Whereas the request was exactly two of each.
Here is its attempt. I add usual boilerplate about being fine to think before answering. First it gives a valid sequence using letters instead of words. I ask for words instead of letters and then it gives a sequence that is only 18 words long. I ask for 20 words and then it finally gets it.
Here’s a second try where I use a disambiguated version of your prompt (without boilerplate) and don’t provide hints beyond “I’m not satisfied, try harder”—the model ends up producing a sequence with placeholders like “unique8″ instead of words, and although I keep saying I’m unsatisfied it makes up nonsensical explanations for why and can’t figure out the real problem. It gets it immediately when I point out that I’m unhappy because “unique8” isn’t a word.
(This is without any custom instructions; it also seems able to do the task without code and its decision of whether to use code is very sensitive to even apparently unrelated instructions.)
I think it’s very likely that GPT-4 with more fine-tuning for general competence will be able to solve this task, and that with fine-tuning or a system prompt for persistence it will need it would not need the “I’m not satisfied, try harder” reminder and will instead keep thinking until its answer is stable on reflection.
I didn’t see a more complicated version in the thread, but I think it’s quite likely that whatever you wrote will also be solved in 2024. I’d wildly guess a 50% chance that by the end of 2024 you will be unable (with an hour of tinkering, say) to design a task like this that’s easy for humans (in the sense that say at least 5% of college graduates can do it within 5 minutes) but hard for the best public agent built with the best public model.
I don’t know [if I understand] full rules so don’t know if this satisfies, but here:
https://chat.openai.com/share/0089e226-fe86-4442-ba07-96c19ac90bd2
Nice prompt! It solved the 3 x 3 problem too.
Wow, I’m impressed it caught itself, was just trying to play with that 3 x 3 problem too. Thanks!
I tested it on 3 held-out problems and it got 1⁄3. Significant progress, increases the chance these can be solved with prompting. So partially it’s a question of if any major LLMs incorporate better auto prompting.
Here’s the harder problem. I’ve also held out a third problem without posting it online.
I’m glad you have held out problems, and I think it would be great if you had a handful (like 3) rather than just one. (If you have 5-10 it would also be cool to plot the success rate going up over time as ChatGPT improves.)
Here is the result of running your prompt with a generic system prompt (asking for an initial answer + refinement). It fails to meet the corner condition (and perplexingly says “The four corners (top left ‘A’, top right ‘A’, bottom left ‘A’, bottom right ‘B’) are distinct.”). When I point out that the four corners aren’t distinct it fixes this problem and gets it correct.
I’m happy to call this a failure until the model doesn’t need someone to point out problems. But I think that’s entirely fine-tuning and prompting and will probably be fixed on GPT-4.
That said, I agree that if you keep making these problems more complicated you will be able to find something that’s still pretty easy for a human (<5 minutes for the top 5% of college grads) and stumps the model. E.g. I tried: fill in a 4 x 4 grid such that one column and row have the same letter 4 times, a second column has the same letter 3 times, and all other rows and columns have distinct letters (here’s the model’s attempt). I’m predicting that this will no longer work by EOY 2024.
I’ve added 6 more held-out problems for a total of 7. Agree that getting the answer without pointing out problems is the right standard.
I can’t tell if you think these problems will remain hard for the model, and if so why.
I think 70% that an LM agent can do the 4x4 grid example by EOY 2024 because it seems pretty easy. I’d update if that was wrong. (And I’d be fine replacing that by held out examples of similar complexity.)
Will you be updating your picture if it can do these tasks by EOY? How much have you updated in the last few years? I feel like 2018 Paul was pretty surprised by how good ChatGPT is now (its turing test ability is maybe ~85th percentile of my forecasts), and that in 2018 you were at least qualitatively trying to argue in the opposite direction.
I think they will remain hard by EOY 2024, as in, of this problem and the 7 held-out ones of similar difficulty, the best LLM will probably not solve 4⁄8.
I think I would update some on how fast LLMs are advancing but these are not inherently very hard problems so I don’t think it would be a huge surprise, this was meant to be one of the easiest things they fail at right now. Maybe if that happens I would think things are going 1.6x as fast short term as I would have otherwise thought?
I was surprised by GPT3/3.5 but not so much by 4, I think it adds up to on net an update that LLMs are advancing faster than I thought, but I haven’t much changed my long-term AGI timelines, because I think that will involve lots of techs not just LLMs, although LLM progress is some update about general tech progress.
Do you have any hard things that you are confident LLMs won’t do soon? (Short of: “autonomously carry out R&D.”) Any tasks you think an LM agent won’t be able to achieve?
Beat Ocarina of Time with <100 hours of playing Zelda games during training or deployment (but perhaps training on other games), no reading guides/walkthroughs/playthroughs, no severe bug exploits (those that would cut down the required time by a lot), no reward-shaping/advice specific to this game generated by humans who know non-trivial things about the game (but the agent can shape its own reward). Including LLM coding a program to do it. I’d say probably not by 2033.
It seems fairly unlikely that this specific task will be completed soon for a variety of reasons: it sounds like it technically requires training a new LM that removes all data about zelda games; it involves a fair amount of videogame-specific engineering hassle; and it’s far from anything with obvious economic relevance + games are out of fashion (not because they are too hard). I do still think it will be done before 2033.
If we could find a similar task that was less out of the way then I’d probably be willing to bet on it happening much sooner. Presumably this is an analogy to something that would be relevant for AI systems automating R&D and is therefore closer to what people are interested in doing with LMs.
Although we can’t bet on it, I do think that if AI developers made a serious engineering effort on the zelda task right now then they would have a reasonable chance of success within 2 years (I’d wildly guess 25%), and this will rise over time. I think GPT-4 with vision will do a reasonable job of identifying the next step needed to complete the game, and models trained with RL to follow instructions in video games across a broad variety of games (including 3d games with similar controls and perspective to Zelda) would likely be competent enough to solve most of the subtasks if you really went all out on it.
I don’t have a good sense of what part you think is hard. I’d guess that the most technically uncertain part is training an RL policy that takes a description of a local task (e.g. “throw a bomb so that it explodes next to the monster’s eye”) and then actually executing it. But my sense is that you might be more concerned about high-level planning.
I think it’s hard because it requires some planning and puzzle solving in a new, somewhat complex environment. The AI results on Montezuma’s Revenge seem pretty unimpressive to me because they’re going to a new room, trying random stuff until they make progress, then “remembering” that for future runs. Which means they need quite a lot of training data.
For short term RL given lots of feedback, there are already decent results e.g. in starcraft and DOTA. So the difficulty is more figuring out how to automatically scope out narrow RL problems that can be learned without too much training time.
I tried it. This run it wrote a python program to solve it correctly, or at least with a valid interpretation of the rules.
https://chat.openai.com/share/ee129414-58d5-41af-9a18-fde2b921b45b
In other runs it guessed a sequence with tokens.
It’s almost like the model needs some kind of introspection, where it can learn when a given tool is more or less likely to produce a correct result, and then produce a solution with that strategy every run.
Running the prompt several times over resulted in it guessing the answer, writing a different python program, using placeholder words, and so on. As a user we want the maximum probability of the correct answer.
I don’t see how that’s a valid interpretation of the rules. Isn’t it checking to find that there is at least one 2x repetition and at least one 3x repetition? Whereas the request was exactly two of each.