I recently asked both claude and gpt4 to estimate their benchmark scores on various benchmarks. if I were trying harder to get a good test I’d probably do it about 10 times and see what the variation is
I asked claude opus whether it could clearly parse different tic-tac-toe notations and it just said ‘yes I can’ to all of them, despite having pretty poor performance in most.
I recently asked both claude and gpt4 to estimate their benchmark scores on various benchmarks. if I were trying harder to get a good test I’d probably do it about 10 times and see what the variation is
I asked claude opus whether it could clearly parse different tic-tac-toe notations and it just said ‘yes I can’ to all of them, despite having pretty poor performance in most.
yeah, its introspection is definitely less than perfect. I’ll DM the prompt I’ve been using so you can see its scores.