Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 10 Dec 2024 18:39 UTC
4 points
2
I recently implemented some reasoning evaluations using UK AISI’s inspect framework, partly as a learning exercise, and partly to create something which I’ll probably use again in my research.

Code here: https://github.com/dtch1997/reasoning-bench

My takeaways so far:
- Inspect is a really good framework for doing evaluations
- When using Inspect, some care has to be taken when defining the scorer in order for it not to be dumb, e.g. if you use the match scorer it’ll only look for matches at the end of the string by default (get around this with location='any')