Strong upvote for tackling an important problem. I’ve tried to write something along these lines, and made no progress.
But I still see lots of room for improvement.
I’d like to see some more sophisticated versions of the Turing tests: use judges that have a decent track record on Turing tests, and have them take longer than 2 hours.
I don’t think the Nano AGI test should rely on statistical significance—that says more about the sample size than about the effect size.
Improved versions of the Turing test seem like a natural place to start. We’ve probably learned more about what language models are capable of in the last two years (since the release of GPT-3) than in all previous years. The Feigenbaum test looks much better to me than the Loebner Silver Prize, for example.
Strong upvote for tackling an important problem. I’ve tried to write something along these lines, and made no progress.
But I still see lots of room for improvement.
I’d like to see some more sophisticated versions of the Turing tests: use judges that have a decent track record on Turing tests, and have them take longer than 2 hours.
I don’t think the Nano AGI test should rely on statistical significance—that says more about the sample size than about the effect size.
Improved versions of the Turing test seem like a natural place to start. We’ve probably learned more about what language models are capable of in the last two years (since the release of GPT-3) than in all previous years. The Feigenbaum test looks much better to me than the Loebner Silver Prize, for example.