Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:
Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I’ve been thinking about this in the abstract for a while but somehow I didn’t see what was right in front of me until now.
I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don’t want this to drop out of my head.
I’ll try to write something proper up soon, but the sketch of what happened is:
It’s been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
And that this could be done in such a way that it wouldn’t look obvious to readers.
And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
Then today I was thinking about LLM+scratchpad as my MVP for ‘scale is all you need’, and I was wondering about how small a scratchpad you’d need, where one possibility was you don’t need one at all you just remove the first words in the context window as you run out of space.
And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
The whole discipline of prompt engineering is a demonstration that most of the instructions we’re giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
Which is obviously miles away from being the case. I haven’t thought about how to formalise it, but I suspect that almost all of the instructions we’re actually giving in prompts are ‘under the surface’, rather than being ‘natural language interpretable’.
I’m embarrassed by how obvious this realisation is in hindsight, but whatever, it’s there now. Now to work out what to do about it.
(I don’t think I’d have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)
I know Steganography implies something like deliberate intent.
I don’t think there’s currently deliberate intent.
I also don’t think deliberate intent matters, but I know this is controversial and haven’t yet articulated why.
Fleshing out why I want to push hard against ‘deliberateness’ mattering is going to take more time than I expect to have this week, and it’s possible writing anything about this before I had time to do it was a mistake.
I think it’s pretty non-obvious, and would put more than 20% on me concluding that I’m totally wrong about this specific thing in the process of trying to write out my intuition.
I get that LLMs have steganographically hidden information. At least I would expect this to be the case, even if I can’t confirm it to the extent I predict it. E.g. you can trigger the concept “apple” via more ways than just prompting “apple”, and you can trigger the entire semantics of “princess Leia eating an apple” via prompts that look entirely different. That is, there could be a many-to-one prompt-to-meaning relationship.
But what’s selecting for more important/dangerous information being hidden? Next-token prediction? Why would it differentially select for important/dangerous?
I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I’m skeptical about preventing this.
Mh, I thought perhaps you were going in this direction. A world where there’s a many-to-one mapping between prompts and output is plausibly a world where the visible mappings are just the tip of the iceberg. And if you then have the opportunity to iteratively filter out seemingly dangerous behaviour, that’s likely to just push the entire behaviour under the surface—still present, just not legible.
I think there’s quite a big difference between ‘bad looking stuff gets selected away’ and ‘design a poisoned token’ and I was talking about the former in the top level comment, but as it happens I don’t think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.
Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:
Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I’ve been thinking about this in the abstract for a while but somehow I didn’t see what was right in front of me until now.
I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don’t want this to drop out of my head.
I’ll try to write something proper up soon, but the sketch of what happened is:
It’s been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
And that this could be done in such a way that it wouldn’t look obvious to readers.
And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
Then today I was thinking about LLM+scratchpad as my MVP for ‘scale is all you need’, and I was wondering about how small a scratchpad you’d need, where one possibility was you don’t need one at all you just remove the first words in the context window as you run out of space.
And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
The whole discipline of prompt engineering is a demonstration that most of the instructions we’re giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
Which is obviously miles away from being the case. I haven’t thought about how to formalise it, but I suspect that almost all of the instructions we’re actually giving in prompts are ‘under the surface’, rather than being ‘natural language interpretable’.
I’m embarrassed by how obvious this realisation is in hindsight, but whatever, it’s there now. Now to work out what to do about it.
(I don’t think I’d have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)
Clarifying something based on a helpful DM:
I know Steganography implies something like deliberate intent.
I don’t think there’s currently deliberate intent.
I also don’t think deliberate intent matters, but I know this is controversial and haven’t yet articulated why.
Fleshing out why I want to push hard against ‘deliberateness’ mattering is going to take more time than I expect to have this week, and it’s possible writing anything about this before I had time to do it was a mistake.
I think it’s pretty non-obvious, and would put more than 20% on me concluding that I’m totally wrong about this specific thing in the process of trying to write out my intuition.
*helplessly confused*
I get that LLMs have steganographically hidden information. At least I would expect this to be the case, even if I can’t confirm it to the extent I predict it. E.g. you can trigger the concept “apple” via more ways than just prompting “apple”, and you can trigger the entire semantics of “princess Leia eating an apple” via prompts that look entirely different. That is, there could be a many-to-one prompt-to-meaning relationship.
But what’s selecting for more important/dangerous information being hidden? Next-token prediction? Why would it differentially select for important/dangerous?
I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I’m skeptical about preventing this.
Mh, I thought perhaps you were going in this direction. A world where there’s a many-to-one mapping between prompts and output is plausibly a world where the visible mappings are just the tip of the iceberg. And if you then have the opportunity to iteratively filter out seemingly dangerous behaviour, that’s likely to just push the entire behaviour under the surface—still present, just not legible.
I think there’s quite a big difference between ‘bad looking stuff gets selected away’ and ‘design a poisoned token’ and I was talking about the former in the top level comment, but as it happens I don’t think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.