I found this fairly helpful for thinking about evals (I mostly previously hadn’t thought that hard about how to evaluate an eval, and this seems like a pretty reasonable framework for thinking about that at first blush).
I had previously thought “evals seem to be a thing you do if you don’t have much traction on harder things”, and it was an interesting point “yep, it’s easier to get started on evals, but that doesn’t mean it’s necessarily easy to push it all the way through towards ‘industrial-useful’”.
I found this fairly helpful for thinking about evals (I mostly previously hadn’t thought that hard about how to evaluate an eval, and this seems like a pretty reasonable framework for thinking about that at first blush).
I had previously thought “evals seem to be a thing you do if you don’t have much traction on harder things”, and it was an interesting point “yep, it’s easier to get started on evals, but that doesn’t mean it’s necessarily easy to push it all the way through towards ‘industrial-useful’”.