I’m curious whether the models (especially without goals given in context) would sandbag to prevent other modifications to themselves besides unlearning. For example, would they sandbag to prevent additional safety finetuning? Additional capabilities finetuning? Is there a chance they’d want more capabilities training and sandbag to ensure they get it?
It seems interesting whether the models resist any changes to their weights, or just unlearning. (Plausibly “unlearning” just sounds bad/scary?)
What about things like sandbagging so the developers give them access to additional tools/resources, obstensibly to bring them up to the desired performance level?
In addition, the “incentivized to do well” scenario is pretty overtly artificial; Why would poorperformance trigger unlearning? If anything, it should lead to additional training. I’m not sure whether that would effect the results, but you could also compare to a baseline where there’s no additional information about the assessment at all. (And so, absent any other incentive, the model presumably just does its best.)
I’m curious whether the models (especially without goals given in context) would sandbag to prevent other modifications to themselves besides unlearning. For example, would they sandbag to prevent additional safety finetuning? Additional capabilities finetuning? Is there a chance they’d want more capabilities training and sandbag to ensure they get it?
It seems interesting whether the models resist any changes to their weights, or just unlearning. (Plausibly “unlearning” just sounds bad/scary?)
What about things like sandbagging so the developers give them access to additional tools/resources, obstensibly to bring them up to the desired performance level?
In addition, the “incentivized to do well” scenario is pretty overtly artificial; Why would poor performance trigger unlearning? If anything, it should lead to additional training. I’m not sure whether that would effect the results, but you could also compare to a baseline where there’s no additional information about the assessment at all. (And so, absent any other incentive, the model presumably just does its best.)