Max H comments on I think I’m just confused. Once a model exists, how do you “red-team” it to see whether it’s safe. Isn’t it already dangerous?

Max H 19 Nov 2023 17:09 UTC
4 points
0
Yeah, I don’t think current LLM architectures, with ~100s of attention layers or whatever, are actually capable of anything like this.

But note that the whole plan doesn’t necessarily need to fit in a single forward pass—just enough of it to figure out what the immediate next action is. If you’re inside of a pre-deployment sandbox (or don’t have enough situational awareness to tell), the immediate next action of any plan (devious or not) probably looks pretty much like “just output a plausible probability distribution on the next token given the current context and don’t waste any layers thinking about your longer-term plans (if any) at all”.

A single forward pass in current architectures is probably analogous to a single human thought, and most human thoughts are not going to be dangerous or devious in isolation, even if they’re part of a larger chain of thoughts or planning process that adds up to deviousness under the right circumstances.