ryan_greenblatt comments on LLM Generality is a Timeline Crux

ryan_greenblatt 25 Jun 2024 21:13 UTC
4 points
2
I also think this is plausible—note that randomly selected examples from the public evaluation set are often considerably harder than the train set on which there is a known MTurk baseline (which is an average of 84%).