Writer comments on GPT-4

Writer 14 Mar 2023 19:57 UTC
37 points
2
Why doesn’t it improve on AP English Literature and AP English Language?
- Adele Lopez 14 Mar 2023 22:35 UTC
  12 points
  1
  Parent
  I don’t have a good guess, but I found the AP English Language exam description with example questions and grading procedures if anyone wants to take a look.
- Throwaway2367 15 Mar 2023 1:39 UTC
  11 points
  0
  Parent
  How is it that bad at codeforces? I competed a few years ago, but in my time div 2 a and b were extremely simple, basically just “implement the described algorithm in code” and if you submitted them quickly (which I expect gpt-4 would excel in) it was easy to reach a significantly better rating than the one reported by this paper.
  
  I hope they didn’t make a mistake by misunderstanding the codeforces rating system (codeforces only awards a fraction of the “estimated rating-current rating” after a competition, but it is possible to exactly calculate the rating equivalent to the given performance from the data provided if you know the details (which I forgot))
  
  When searching the paper for the exact methodology (by ctrl-f’ing “codeforces”), I haven’t found anything.
  - hazel 15 Mar 2023 10:38 UTC
    3 points
    0
    Parent
    Codeforces is not marked as having a GPT-4 measurement on this chart. Yes, it’s a somewhat confusing chart.
    - Throwaway2367 15 Mar 2023 10:46 UTC
      10 points
      0
      Parent
      I know. I skimmed the paper, and in it there is a table above the chart showing the results in the tasks for all models (as every model’s performance is below 5% in codeforces, on the chart they overlap). I replied to the comment I replied to because thematically it seemed the most appropriate (asking about task performance), sorry if my choice of where to comment was confusing.
      
      From the table:
      
      GPT-3.5′s codeforces rating is “260 (below 5%)”
      GPT-4′s codeforces rating is “392 (below 5%)”
- ryan_greenblatt 14 Mar 2023 22:40 UTC
  4 points
  0
  Parent
  Perhaps the model wasn’t allowed to read the sources for the free response section?
  - Theresa Barton 22 Mar 2023 18:23 UTC
    1 point
    0
    Parent
    I think performance on AP english might be a quirk of how they dealt with dataset contamination. English and Literature exams showed anomalous amount of contamination (lots of the famous texts are online and referenced elsewhere) so they threw out most of the questions, leading to a null conclusion about performance.
- hazel 15 Mar 2023 10:37 UTC
  0 points
  −1
  Parent
  Green bars are GPT-4. Blue bars are not. I suspect they just didn’t retest everything.
  - peterbarnett 15 Mar 2023 16:12 UTC
    16 points
    4
    Parent
    They did run the tests for all models, from Table 1:
    (the columns are GPT-4, GPT-4 (no vision), GPT-3.5)
  - Writer 15 Mar 2023 10:49 UTC
    4 points
    0
    Parent
    It would be weird to include them if they didn’t run those tests. My read was that the green bars are the same height as the blue bars, so they are hidden behind.
    - hazel 15 Mar 2023 11:01 UTC
      0 points
      1
      Parent
      Meaning it literally showed zero difference in half the tests? Does that make sense?
      - sanxiyn 15 Mar 2023 17:43 UTC
        3 points
        −1
        Parent
        AP exams are scored on a scale of 1 to 5, so yes, getting the exact same score with zero difference makes sense.
      - Writer 15 Mar 2023 11:05 UTC
        2 points
        0
        Parent
        Roughly ¹⁄₃ of the tests but yeah, that’s why I’m confused. Looks weird enough.