The succession of new OpenAI products has proven to me that I’m bad at articulating benchmarks for AI success.
For example, ChatGPT can generate working Python code for a game of mancala, except that it ignores captures and second turns completely, and the UI is terrible. But I’m pretty good at Python, and it would be easier for me to debug and improve ChatGPT’s code than to write a complete mancala game.
But I wouldn’t have thought to set out “writing code that can be fixed faster than a working program can be written from scratch” as a benchmark. In hindsight, it’s clearly a reasonable benchmark, and illustrates the smoothly-scaling capabilities of these systems. I should use ChatGPT to come up with benchmarks for OpenAI’s next text generating AI.
The idea of having ChatGPT invent benchmarks can’t be tested by just asking it to, but I tried asking it to come up with a slightly more difficult intellectual challenge than writing easily debugged code. Its only two ideas seem to be:
Designing and implementing a new programming language that is easier to read and understand than existing languages, and has built-in features for debugging and error-checking.
Writing efficient and optimized algorithms for complex problems.
I don’t think either of these seem merely “slightly more difficult” than inventing easily debuggable code.
The succession of new OpenAI products has proven to me that I’m bad at articulating benchmarks for AI success.
For example, ChatGPT can generate working Python code for a game of mancala, except that it ignores captures and second turns completely, and the UI is terrible. But I’m pretty good at Python, and it would be easier for me to debug and improve ChatGPT’s code than to write a complete mancala game.
But I wouldn’t have thought to set out “writing code that can be fixed faster than a working program can be written from scratch” as a benchmark. In hindsight, it’s clearly a reasonable benchmark, and illustrates the smoothly-scaling capabilities of these systems. I should use ChatGPT to come up with benchmarks for OpenAI’s next text generating AI.
The idea of having ChatGPT invent benchmarks can’t be tested by just asking it to, but I tried asking it to come up with a slightly more difficult intellectual challenge than writing easily debugged code. Its only two ideas seem to be:
Designing and implementing a new programming language that is easier to read and understand than existing languages, and has built-in features for debugging and error-checking.
Writing efficient and optimized algorithms for complex problems.
I don’t think either of these seem merely “slightly more difficult” than inventing easily debuggable code.