Spicy take: good evals for automated ML R&D should (also) cover for what’s in the attached picture (and try hard at elicitation in this rough shape). AFAIK, last time I looked at the main (public) proposals, they didn’t seem to. Picture from https://x.com/RobertTLange/status/1829104918214447216.
Spicy take: good evals for automated ML R&D should (also) cover for what’s in the attached picture (and try hard at elicitation in this rough shape). AFAIK, last time I looked at the main (public) proposals, they didn’t seem to. Picture from https://x.com/RobertTLange/status/1829104918214447216.