If you take the distance between the North and South pole and divide it by ten million: voilà, you have a meter!
NB: The circumference of the Earth is ~40k km—this definition of a meter should instead mention the distance from the North or South pole to the Equator.
RE: GPT getting dumber, that paper is horrendous.
The code gen portion was completely thrown off because of Markdown syntax (the authors mistook back-ticks for single-quotes, afaict). I think the update to make there is that it is decent evidence that there was some RLHF on ChatGPT outputs. If you remember from that “a human being will die if you don’t reply with pure JSON” tweet, even that final JSON code was escaped with markdown. My modal guess is that markdown was inserted via cludge to make the ChatGPT UX better, and then RLHF was done on that cludged output. Code sections are often mislabeled for what language they contain. My secondary guess is that the authors used an API which had this cludged added on top of it, such that GPT just wouldn’t output plaintext code, tho that is baffled by the “there were any passing examples”.
In the math portion they say GPT-4-0613 only averaged 3.8 CHARACTERS per response. Note that “[NO]” and “[YES]” both contain more than 3.8 characters. Note that GPT-4 does not answer hardly any queries with a single word. Note that the paper’s example answer for the primality question included 1000 characters, so the remaining questions apparently averaged 3 characters flat. Even if you think they only fucked up that data analysis: I also replicated GPT-4 failing to solve “large” number primality, and am close to calling a that cherry picked example. It is a legit difficult problem for GPT, I agree that anyone who goes to ChatGPT to replicate will agree the answer they get back is a coin flip at best. But we need to say it again for the kids in the back: the claim is that GPT-4 got 2% on yes/no questions. What do we call a process that gets 2% on coin flip questions?