Not quite sure tbh. 1. I guess there is a difference between capability evaluations with prompting and with fine-tuning, e.g. you might be able to use an API for prompting but not fine-tuning. Getting some intuition for how hard users will find it to elicit some behavior through the API seems relevant. 2. I’m not sure how true your suggestion is but I haven’t tried it a lot empirically. But this is exactly the kind of stuff I’d like to have some sort of scaling law or rule for. It points exactly at the kind of stuff I feel like we don’t have enough confidence in. Or at least it hasn’t been established as a standard in evals.
Not quite sure tbh.
1. I guess there is a difference between capability evaluations with prompting and with fine-tuning, e.g. you might be able to use an API for prompting but not fine-tuning. Getting some intuition for how hard users will find it to elicit some behavior through the API seems relevant.
2. I’m not sure how true your suggestion is but I haven’t tried it a lot empirically. But this is exactly the kind of stuff I’d like to have some sort of scaling law or rule for. It points exactly at the kind of stuff I feel like we don’t have enough confidence in. Or at least it hasn’t been established as a standard in evals.