the untrusted model is stateless between queries and only sees the command history and system state.
What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.
If you’re assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.
Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn’t second guess it, I agree it’s wrong.
Noting that I spent a couple minutes pondering the quoted passage which I don’t think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
I don’t see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.
It seems like that assumption runs throughout the post though, eg ‘But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said’, ‘the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.’
I don’t just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can’t be trusted?
It seems like that assumption runs throughout the post though, eg ‘But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said’, ‘the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.’
Those sentences are talking in the context of defer to resample, where I think it’s actually reasonable to think the model won’t see the resampled actions.
I don’t just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change?
I’m specifically talking about not showing it actions that were resampled. You’re totally right that they’ll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).
Is the assumption that labs will be willing to accept worse performance because they recognize the model can’t be trusted?
Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.
One story here could be that the model is being used only in an API context where it’s being asked to take actions on something well-modeled as a Markov process, where the past doesn’t matter (and we assume that the current state doesn’t incorporate relevant information about the past). There are certainly use cases that fit that (‘trigger invisible fence iff this image contains a dog’; ‘search for and delete this file’). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.
Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn’t apply to current LLMs but seems like a very plausible path for economically useful models like agents.
What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.
If you’re assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.
Hmmm, good call, it will probably see the history of its chain of thought. My dear coauthor Claude wrote that and I didn’t second guess it, I agree it’s wrong.
Noting that I spent a couple minutes pondering the quoted passage which I don’t think was a good use of time (I basically would have immediately dismissed it if I knew Claude wrote it, and I only thought about it because my prior on Buck saying true things is way higher), and I would have preferred the text not have this.
I don’t see anybody having mentioned it yet, but the recent paper about LLM Introspection seems pretty relevant. I would say that a model which performs very well at introspection (as defined there) would be able to effectively guess which jailbreak strategies were attempted.
It seems like that assumption runs throughout the post though, eg ‘But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said’, ‘the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.’
I don’t just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can’t be trusted?
Those sentences are talking in the context of defer to resample, where I think it’s actually reasonable to think the model won’t see the resampled actions.
I’m specifically talking about not showing it actions that were resampled. You’re totally right that they’ll need other information about their actions and interactions (though it might be a good safety-usefulness tradeoff to somewhat redact this information).
Yes, I am assuming that the deployer is willing to spend some performance in order to reduce risk from the model scheming.
I see—I mistakenly read that as part of a broader policy of not showing the model its past actions. Thanks!
One story here could be that the model is being used only in an API context where it’s being asked to take actions on something well-modeled as a Markov process, where the past doesn’t matter (and we assume that the current state doesn’t incorporate relevant information about the past). There are certainly use cases that fit that (‘trigger invisible fence iff this image contains a dog’; ‘search for and delete this file’). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.
Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn’t apply to current LLMs but seems like a very plausible path for economically useful models like agents.