Some results: - Llama2 starts making mistakes after 5 numbers - Llama3 can do 10, but fails at 20 - GPT-4 can do 20 but fails at 40
The followup questions are: - what should be the name of this metric? - are the other top-scoring models like Claude similar? (I don’t have access) - any bets on how many numbers will GPT-5 be able to reverse? - how many numbers should AGI be able to reverse? ASI? can this be a Turing test of sorts?
What about estimating LLM capabilities from the length of a sequence of numbers that it can reverse?
I used prompts like:
”please reverse 4 5 8 1 1 8 1 4 4 9 3 9 3 3 3 5 5 2 7 8“
”please reverse 1 9 4 8 6 1 3 2 2 5”
etc...
Some results:
- Llama2 starts making mistakes after 5 numbers
- Llama3 can do 10, but fails at 20
- GPT-4 can do 20 but fails at 40
The followup questions are:
- what should be the name of this metric?
- are the other top-scoring models like Claude similar? (I don’t have access)
- any bets on how many numbers will GPT-5 be able to reverse?
- how many numbers should AGI be able to reverse? ASI? can this be a Turing test of sorts?
In psychometrics this is called “backward digit span”.