Marius Hobbhahn comments on We need a Science of Evals

Marius Hobbhahn 23 Jan 2024 9:20 UTC
LW: 2 AF: 1
0
AF
Not quite sure tbh.
1. I guess there is a difference between capability evaluations with prompting and with fine-tuning, e.g. you might be able to use an API for prompting but not fine-tuning. Getting some intuition for how hard users will find it to elicit some behavior through the API seems relevant.
2. I’m not sure how true your suggestion is but I haven’t tried it a lot empirically. But this is exactly the kind of stuff I’d like to have some sort of scaling law or rule for. It points exactly at the kind of stuff I feel like we don’t have enough confidence in. Or at least it hasn’t been established as a standard in evals.