Yeah, I don’t think current LLM architectures, with ~100s of attention layers or whatever, are actually capable of anything like this.
But note that the whole plan doesn’t necessarily need to fit in a single forward pass—just enough of it to figure out what the immediate next action is. If you’re inside of a pre-deployment sandbox (or don’t have enough situational awareness to tell), the immediate next action of any plan (devious or not) probably looks pretty much like “just output a plausible probability distribution on the next token given the current context and don’t waste any layers thinking about your longer-term plans (if any) at all”.
A single forward pass in current architectures is probably analogous to a single human thought, and most human thoughts are not going to be dangerous or devious in isolation, even if they’re part of a larger chain of thoughts or planning process that adds up to deviousness under the right circumstances.
Yeah, I don’t think current LLM architectures, with ~100s of attention layers or whatever, are actually capable of anything like this.
But note that the whole plan doesn’t necessarily need to fit in a single forward pass—just enough of it to figure out what the immediate next action is. If you’re inside of a pre-deployment sandbox (or don’t have enough situational awareness to tell), the immediate next action of any plan (devious or not) probably looks pretty much like “just output a plausible probability distribution on the next token given the current context and don’t waste any layers thinking about your longer-term plans (if any) at all”.
A single forward pass in current architectures is probably analogous to a single human thought, and most human thoughts are not going to be dangerous or devious in isolation, even if they’re part of a larger chain of thoughts or planning process that adds up to deviousness under the right circumstances.