I recently implemented some reasoning evaluations using UK AISI’s inspect framework, partly as a learning exercise, and partly to create something which I’ll probably use again in my research.
My takeaways so far: - Inspect is a really good framework for doing evaluations - When using Inspect, some care has to be taken when defining the scorer in order for it not to be dumb, e.g. if you use the match scorer it’ll only look for matches at the end of the string by default (get around this with location='any')
I recently implemented some reasoning evaluations using UK AISI’s
inspect
framework, partly as a learning exercise, and partly to create something which I’ll probably use again in my research.Code here: https://github.com/dtch1997/reasoning-bench
My takeaways so far:
- Inspect is a really good framework for doing evaluations
- When using Inspect, some care has to be taken when defining the
scorer
in order for it not to be dumb, e.g. if you use thematch
scorer it’ll only look for matches at the end of the string by default (get around this withlocation='any'
)