Treacherous turn test:
To be clear, in this test we wouldn’t be doing any gradient updates to make the model perform the treacherous turn, but rather we are hoping that the prompt alone will make it do so?
To be clear, in this test we wouldn’t be doing any gradient updates to make the model perform the treacherous turn, but rather we are hoping that the prompt alone will make it do so?