I expect you’d instead need to tune the base model to elicit relevant capabilities first. So instead of evaluating a tuned model intended for deployment (which can refuse to display some capabilities), or a base model (which can have difficulties with displaying some capabilities), you need to tune the model to be more purely helpful, possibly in a way specific to the tasks it’s to be evaluated on.
I expect you’d instead need to tune the base model to elicit relevant capabilities first. So instead of evaluating a tuned model intended for deployment (which can refuse to display some capabilities), or a base model (which can have difficulties with displaying some capabilities), you need to tune the model to be more purely helpful, possibly in a way specific to the tasks it’s to be evaluated on.