Specifically, I’ve been trying to get GPT-3 to outperform the Hypothesis Ghostwriter in automatic generation of tests and specifications, without any success. I expect that GPT-4 will also underperform; but that it could outperform if fine-tuned on the problem.
If I knew where to get training data I’d like to try this with GPT-3 for that matter; I’m much more attached to the user experience of ”hypothesis write mypackage generates good tests” than any particular implementation (modulo installation and other managable ops issues for novice users).
I think the general answer to testing seems AGI-complete in the sense that you should understand the edge-cases of a function (or correct output from “normal” input).
if we take the simplest testing case, let’s say python using pytest, with a typed code, with some simple test for each type (eg. 0 and 1 for integers, empty/random strings, etc.) then you could show it examples on how to generate tests from function names… but then you could also just do it with reg-ex, so I guess with hypothesis.
so maybe the right question to ask is: what do you expect GPT-4 to do better than GPT-3 relative to the train distribution (which will have maybe 1-2y of more github data) + context window? What’s the bottleneck? When would you say “I’m pretty sure there’s enough capacity to do it”? What are the few-shot examples you feed your model?
Testing in full generality is certainly AGI-complete (and a nice ingredient for recursive self-improvement!), but I think you’re overestimating the difficulty of pattern-matching your way to decent tests. Chess used to be considered AGI-complete too; I’d guess testing is more like poetry+arithmetic in that if you can handle context, style, and some details it comes out pretty nicely.
I expect GPT-4 to be substantially better at this ‘out of the box’ due to
the usual combination of larger, better at generalising, scaling laws, etc.
super-linear performance gains on arithmetic-like tasks due to generalisation, with spillover to code-related tasks
the extra github (and blog, etc) data is probably pretty helpful given steady adoption since ~2015 or so
Example outputs from Ghostwriter vs GPT-3:
$ hypothesis write gzip.compress
import gzip
from hypothesis import given, strategies as st
@given(compresslevel=st.just(9), data=st.nothing())
def test_roundtrip_compress_decompress(compresslevel, data):
value0 = gzip.compress(data=data, compresslevel=compresslevel)
value1 = gzip.decompress(data=value0)
assert data == value1, (data, value1)
while GPT-3 tends to produce examples like (first four that I generated just now):
So it’s clearly ‘getting the right idea’, even without any fine-tuning at all, but not there yet. It’s also a lot worse at this without a natural-language description of the test we want in the prompt:
This document was written by a very smart AI programmer,
who is skilled in testing Python, friendly, and helpful.
Let's start by testing the two properties of sorting a list: the output
is a permutation of the input, and the elements of the output are
ordered:
@given(st.lists(st.integers()))
def test(xs):
result = sorted(xs)
assert set(result) == set(xs)
assert is_in_order(result)
Our second test demonstrates a round-trip test. Given any string
of bytes, if you use gzip to compress and then decompress, the result
will be equal to the input:
Of course translating natural language into Python code is pretty cool too!
I object to the characterization that it is “getting the right idea.” It seems to have latched on to “given a foo of bar” → “@given(foo.bar)” and that “assert” should be used, but the rest is word salad, not code.
It’s at least syntatically-valid word salad composed of relevant words, which is a substantial advance—and per Gwern, I’m very cautious about generalising from “the first few results from this prompt are bad” to “GPT can’t X”.
Reason about code.
Specifically, I’ve been trying to get GPT-3 to outperform the Hypothesis Ghostwriter in automatic generation of tests and specifications, without any success. I expect that GPT-4 will also underperform; but that it could outperform if fine-tuned on the problem.
If I knew where to get training data I’d like to try this with GPT-3 for that matter; I’m much more attached to the user experience of ”
hypothesis write mypackage
generates good tests” than any particular implementation (modulo installation and other managable ops issues for novice users).I think the general answer to testing seems AGI-complete in the sense that you should understand the edge-cases of a function (or correct output from “normal” input).
if we take the simplest testing case, let’s say python using pytest, with a typed code, with some simple test for each type (eg. 0 and 1 for integers, empty/random strings, etc.) then you could show it examples on how to generate tests from function names… but then you could also just do it with reg-ex, so I guess with hypothesis.
so maybe the right question to ask is: what do you expect GPT-4 to do better than GPT-3 relative to the train distribution (which will have maybe 1-2y of more github data) + context window? What’s the bottleneck? When would you say “I’m pretty sure there’s enough capacity to do it”? What are the few-shot examples you feed your model?
Testing in full generality is certainly AGI-complete (and a nice ingredient for recursive self-improvement!), but I think you’re overestimating the difficulty of pattern-matching your way to decent tests. Chess used to be considered AGI-complete too; I’d guess testing is more like poetry+arithmetic in that if you can handle context, style, and some details it comes out pretty nicely.
I expect GPT-4 to be substantially better at this ‘out of the box’ due to
the usual combination of larger, better at generalising, scaling laws, etc.
super-linear performance gains on arithmetic-like tasks due to generalisation, with spillover to code-related tasks
the extra github (and blog, etc) data is probably pretty helpful given steady adoption since ~2015 or so
Example outputs from Ghostwriter vs GPT-3:
while GPT-3 tends to produce examples like (first four that I generated just now):
So it’s clearly ‘getting the right idea’, even without any fine-tuning at all, but not there yet. It’s also a lot worse at this without a natural-language description of the test we want in the prompt:
Of course translating natural language into Python code is pretty cool too!
I object to the characterization that it is “getting the right idea.” It seems to have latched on to “given a foo of bar” → “@given(foo.bar)” and that “assert” should be used, but the rest is word salad, not code.
It’s at least syntatically-valid word salad composed of relevant words, which is a substantial advance—and per Gwern, I’m very cautious about generalising from “the first few results from this prompt are bad” to “GPT can’t X”.