Aha! Thanks. (To save others a click: “change list” really just means “change” or “commit”: a single thing checked into version control or submitted for review.) I’m not sure the joke really lands for me—maybe I’m stupider than both GPT-3 and PaLM. It seems like the joke could be (1) the intern produced a hilariously excessive amount of code, perhaps because they e.g. failed to use elementary techniques like functions and loops for removing redundancy, or (2) the intern produced a normal amount of code but it was so bad that reading it was as painful as if it had been War-and-Peace-sized, or (3) the reviewer is incredibly lazy (so is telling a joke against himself) and finds reading even small amounts of other people’s code terribly hard work. Normally I’d use the obvious heuristic that the intended meaning is the one that’s funny, but unfortunately none of them seems very funny to me. I guess probably it’s #2?
(This is the difficulty about making up one’s own jokes for this sort of test...)
I am sure the situation is that the intern never pushed his code to VCS for a few months, just wrote it locally, and then pushed tons of code. It is dreading because 1 day is a very small amount of time to review so much code.
Since it was the intern’s last day, they might have been less careful with their coding (or, depending on why they’re leaving, even added deliberate errors), so the reviewer will have to be extra thorough checking it.
Yeah, I do wonder how most of the example jokes not actually being very funny is effecting the results… It also is weird that they make an explicit reference to a term which is only used internally, and which presumably PaLM has little-to-no training on. Was that on purpose, or a slip-up by the authors?
Aha! Thanks. (To save others a click: “change list” really just means “change” or “commit”: a single thing checked into version control or submitted for review.) I’m not sure the joke really lands for me—maybe I’m stupider than both GPT-3 and PaLM. It seems like the joke could be (1) the intern produced a hilariously excessive amount of code, perhaps because they e.g. failed to use elementary techniques like functions and loops for removing redundancy, or (2) the intern produced a normal amount of code but it was so bad that reading it was as painful as if it had been War-and-Peace-sized, or (3) the reviewer is incredibly lazy (so is telling a joke against himself) and finds reading even small amounts of other people’s code terribly hard work. Normally I’d use the obvious heuristic that the intended meaning is the one that’s funny, but unfortunately none of them seems very funny to me. I guess probably it’s #2?
(This is the difficulty about making up one’s own jokes for this sort of test...)
I am sure the situation is that the intern never pushed his code to VCS for a few months, just wrote it locally, and then pushed tons of code. It is dreading because 1 day is a very small amount of time to review so much code.
The fact that we humans are having trouble understanding this joke does not bode well for its use as an AI benchmark…
Since it was the intern’s last day, they might have been less careful with their coding (or, depending on why they’re leaving, even added deliberate errors), so the reviewer will have to be extra thorough checking it.
Yeah, I do wonder how most of the example jokes not actually being very funny is effecting the results… It also is weird that they make an explicit reference to a term which is only used internally, and which presumably PaLM has little-to-no training on. Was that on purpose, or a slip-up by the authors?