Timothy Bates: The more things change, the more they stay the same: 1943 paper shows that a mechanical prediction of admissions greatly out predicts the decisions from administrators asked to add their subjective judgement :-(excellent talk from Nathan Kuncel !)
Nick Brown: I would bet that if you asked those subjective evaluators, they would say “We know the grades are the best predictor on average, but ‘sometimes’ they don’t tell the whole story”. People want to double-dip: Use the method most of the time, but add their own “special expertise”.
Timothy Bates: Nathan Kuncel put it astutely showing that decision makers beta weights are pretty accurate, but then they ruin their decision at “run time” by adding random intuitions about details in the application :-)
I don’t like that this reasoning is based on using the correlation between rating and GPA, because I think GPA is goodharted. It is right to not select the admission process based on correlation with GPA. I think maybe this would be the case even if the humans where adding white noise.
I joke about calling them ‘Openly Evil AI,’ but developing a 99.9% effective watermarking tool and then sitting on it because people identifying your outputs would be bad for business? Yeah, that’s something.
Maybe if you solve for equilibrium you get that after releasing the tool, the tool is defeated reasonably quickly?
[json formatting in gpt]
How did they do it?
> OpenAI: While sampling, after every token, our inference engine will determine which tokens are valid to be produced next based on the previously generated tokens and the rules within the grammar that indicate which tokens are valid next. We then use this list of tokens to mask the next sampling step, which effectively lowers the probability of invalid tokens to 0. Because we have preprocessed the schema, we can use a cached data structure to do this efficiently, with minimal latency overhead.
Throw out all invalid outputs, and all the outputs that remain will be valid. Nice.
This is obvious. Why wasn’t it available already? I guess bandwidth is what it is.
In the “Agent Performance vs Humans with Time Limits (95% CI)” figure: the 95% cl bars are fishy, because they look large compared to the regularity of the bars. i.e., the bars smoothly match the relative rankings I already expected from the models: if there were independent fluctuations of that size, the bars would be jagged compared to the expected ranking based on general model quality we already know. Possible explanations:
They made a mistake and the bars are based on the population sdev rather than the standard error. In this case the actual bars would be smaller.
The LLM scores are correlated: the randomness is given by the tasks, not by the models, so there’s some uncertainty related to the small number of tasks, but the inter-LLM comparisons are pretty accurate. In this case, the bars for comparing LLMs would be smaller, while the present bars are fine for comparing LLM to humans (probably).
Maybe if you solve for equilibrium you get that after releasing the tool, the tool is defeated reasonably quickly?
I believe it’s already known that running the text through another (possibly smaller and cheaper) LLM to reword it can remove the watermarking. So for catching cheaters it’s only a tiny bit stronger than searching for “as a large language model” in the text.
I don’t like that this reasoning is based on using the correlation between rating and GPA, because I think GPA is goodharted. It is right to not select the admission process based on correlation with GPA. I think maybe this would be the case even if the humans where adding white noise.
Maybe if you solve for equilibrium you get that after releasing the tool, the tool is defeated reasonably quickly?
This is obvious. Why wasn’t it available already? I guess bandwidth is what it is.
In the “Agent Performance vs Humans with Time Limits (95% CI)” figure: the 95% cl bars are fishy, because they look large compared to the regularity of the bars. i.e., the bars smoothly match the relative rankings I already expected from the models: if there were independent fluctuations of that size, the bars would be jagged compared to the expected ranking based on general model quality we already know. Possible explanations:
They made a mistake and the bars are based on the population sdev rather than the standard error. In this case the actual bars would be smaller.
The LLM scores are correlated: the randomness is given by the tasks, not by the models, so there’s some uncertainty related to the small number of tasks, but the inter-LLM comparisons are pretty accurate. In this case, the bars for comparing LLMs would be smaller, while the present bars are fine for comparing LLM to humans (probably).
I believe it’s already known that running the text through another (possibly smaller and cheaper) LLM to reword it can remove the watermarking. So for catching cheaters it’s only a tiny bit stronger than searching for “as a large language model” in the text.