author on Binder et al. 2024 here. Thanks for reading our paper and suggesting the experiment!
To summarize the suggested experiment:
Train a model to be calibrated on whether it gets an answer correcct.
Modify the model (e.g. activation steering). This changes the model’s performance on whether it gets an answer correct.
Check if the modified model is still well calibrated.
This could work and I’m excited about it.
One failure mode is that the modification makes the model very dumb in all instances. Then its easy to be well calibrated on all these instances—just assume the model is dumb. An alternative is to make the model do better on some instances (by finetuning?), and check if the model is still calibrated on those too.
One failure mode is that the modification makes the model very dumb in all instances.
Yea, good point. Perhaps an extra condition we’d need to include is that the “difficulty of meta-level questions” should be the same before and after the modification—e.g. - the distribution over stuff it’s good at and stuff its bad at should be just as complex (not just good at everything or bad at everything) before and after
author on Binder et al. 2024 here. Thanks for reading our paper and suggesting the experiment!
To summarize the suggested experiment:
Train a model to be calibrated on whether it gets an answer correcct.
Modify the model (e.g. activation steering). This changes the model’s performance on whether it gets an answer correct.
Check if the modified model is still well calibrated.
This could work and I’m excited about it.
One failure mode is that the modification makes the model very dumb in all instances. Then its easy to be well calibrated on all these instances—just assume the model is dumb. An alternative is to make the model do better on some instances (by finetuning?), and check if the model is still calibrated on those too.
Thanks James!
Yea, good point. Perhaps an extra condition we’d need to include is that the “difficulty of meta-level questions” should be the same before and after the modification—e.g. - the distribution over stuff it’s good at and stuff its bad at should be just as complex (not just good at everything or bad at everything) before and after