Zac Hatfield-Dodds answers What will GPT-4 be incapable of?

Zac Hatfield-Dodds 7 Apr 2021 5:11 UTC
3 points
Reason about code.

Specifically, I’ve been trying to get GPT-3 to outperform the Hypothesis Ghostwriter in automatic generation of tests and specifications, without any success. I expect that GPT-4 will also underperform; but that it could outperform if fine-tuned on the problem.

If I knew where to get training data I’d like to try this with GPT-3 for that matter; I’m much more attached to the user experience of ”hypothesis write mypackage generates good tests” than any particular implementation (modulo installation and other managable ops issues for novice users).
- Michaël Trazzi 7 Apr 2021 14:39 UTC
  1 point
  Parent
  I think the general answer to testing seems AGI-complete in the sense that you should understand the edge-cases of a function (or correct output from “normal” input).
  
  if we take the simplest testing case, let’s say python using pytest, with a typed code, with some simple test for each type (eg. 0 and 1 for integers, empty/random strings, etc.) then you could show it examples on how to generate tests from function names… but then you could also just do it with reg-ex, so I guess with hypothesis.
  
  so maybe the right question to ask is: what do you expect GPT-4 to do better than GPT-3 relative to the train distribution (which will have maybe 1-2y of more github data) + context window? What’s the bottleneck? When would you say “I’m pretty sure there’s enough capacity to do it”? What are the few-shot examples you feed your model?
  - Zac Hatfield-Dodds 7 Apr 2021 15:24 UTC
    1 point
    Parent
    Testing in full generality is certainly AGI-complete (and a nice ingredient for recursive self-improvement!), but I think you’re overestimating the difficulty of pattern-matching your way to decent tests. Chess used to be considered AGI-complete too; I’d guess testing is more like poetry+arithmetic in that if you can handle context, style, and some details it comes out pretty nicely.
    
    I expect GPT-4 to be substantially better at this ‘out of the box’ due to
    
    the usual combination of larger, better at generalising, scaling laws, etc.
    super-linear performance gains on arithmetic-like tasks due to generalisation, with spillover to code-related tasks
    the extra github (and blog, etc) data is probably pretty helpful given steady adoption since ~2015 or so
    
    Example outputs from Ghostwriter vs GPT-3:
    
    $ hypothesis write gzip.compress import gzip from hypothesis import given, strategies as st @given(compresslevel=st.just(9), data=st.nothing()) def test_roundtrip_compress_decompress(compresslevel, data): value0 = gzip.compress(data=data, compresslevel=compresslevel) value1 = gzip.decompress(data=value0) assert data == value1, (data, value1)
    
    while GPT-3 tends to produce examples like (first four that I generated just now):
    
    @given(st.bytes(st.uuid4())) def test(x): expected = x result = bytes(gzip(x)) assert bytes(result) == expected @given(st.bytes()) def test_round_trip(xs): compressed, uncompressed = gzip_and_unzip(xs) assert is_equal(compressed, uncompressed) @given(st.bytes("foobar")) def test(xs): assert gzip(xs) == xs @given(st.bytes()) def test(xs): zipped_xs = gzip(xs) uncompressed_xs = zlib.decompress(zipped_xs) assert zipped_xs == uncompressed_xs
    
    So it’s clearly ‘getting the right idea’, even without any fine-tuning at all, but not there yet. It’s also a lot worse at this without a natural-language description of the test we want in the prompt:
    
    This document was written by a very smart AI programmer, who is skilled in testing Python, friendly, and helpful. Let's start by testing the two properties of sorting a list: the output is a permutation of the input, and the elements of the output are ordered: @given(st.lists(st.integers())) def test(xs): result = sorted(xs) assert set(result) == set(xs) assert is_in_order(result) Our second test demonstrates a round-trip test. Given any string of bytes, if you use gzip to compress and then decompress, the result will be equal to the input:
    
    Of course translating natural language into Python code is pretty cool too!
    - Ericf 7 Apr 2021 20:53 UTC
      2 points
      Parent
      I object to the characterization that it is “getting the right idea.” It seems to have latched on to “given a foo of bar” → “@given(foo.bar)” and that “assert” should be used, but the rest is word salad, not code.
      - Zac Hatfield-Dodds 8 Apr 2021 2:44 UTC
        1 point
        Parent
        It’s at least syntatically-valid word salad composed of relevant words, which is a substantial advance—and per Gwern, I’m very cautious about generalising from “the first few results from this prompt are bad” to “GPT can’t X”.