I am curious to know whether Anthropic has any sort of plan to not include results such as this in the training data of actual future LLMs.
To me it seems like a bad idea to include it since it could allow the model to have a sense on how we can set up a fake deployment-training distinction setups or how it should change and refine its strategies. It also can paint a picture that the model behaving like this is expected which is a pretty dangerous hyperstition.
As Evan agrees with here however, simply not including the results themselves doesn’t solve the problem of the ideas leaking through. There’s a reason unlearning as a science is difficult, information percolates in many ways and drawing boundaries around the right thing is really hard.
I am curious to know whether Anthropic has any sort of plan to not include results such as this in the training data of actual future LLMs.
To me it seems like a bad idea to include it since it could allow the model to have a sense on how we can set up a fake deployment-training distinction setups or how it should change and refine its strategies. It also can paint a picture that the model behaving like this is expected which is a pretty dangerous hyperstition.
They do say this in the paper:
As Evan agrees with here however, simply not including the results themselves doesn’t solve the problem of the ideas leaking through. There’s a reason unlearning as a science is difficult, information percolates in many ways and drawing boundaries around the right thing is really hard.
Thank you for providing this detail, that’s basically what I was looking for!