Joseph Bloom comments on Testbed evals: evaluating AI safety even when it can’t be directly measured