I get that LLMs have steganographically hidden information. At least I would expect this to be the case, even if I can’t confirm it to the extent I predict it. E.g. you can trigger the concept “apple” via more ways than just prompting “apple”, and you can trigger the entire semantics of “princess Leia eating an apple” via prompts that look entirely different. That is, there could be a many-to-one prompt-to-meaning relationship.
But what’s selecting for more important/dangerous information being hidden? Next-token prediction? Why would it differentially select for important/dangerous?
I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I’m skeptical about preventing this.
Mh, I thought perhaps you were going in this direction. A world where there’s a many-to-one mapping between prompts and output is plausibly a world where the visible mappings are just the tip of the iceberg. And if you then have the opportunity to iteratively filter out seemingly dangerous behaviour, that’s likely to just push the entire behaviour under the surface—still present, just not legible.
*helplessly confused*
I get that LLMs have steganographically hidden information. At least I would expect this to be the case, even if I can’t confirm it to the extent I predict it. E.g. you can trigger the concept “apple” via more ways than just prompting “apple”, and you can trigger the entire semantics of “princess Leia eating an apple” via prompts that look entirely different. That is, there could be a many-to-one prompt-to-meaning relationship.
But what’s selecting for more important/dangerous information being hidden? Next-token prediction? Why would it differentially select for important/dangerous?
I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I’m skeptical about preventing this.
Mh, I thought perhaps you were going in this direction. A world where there’s a many-to-one mapping between prompts and output is plausibly a world where the visible mappings are just the tip of the iceberg. And if you then have the opportunity to iteratively filter out seemingly dangerous behaviour, that’s likely to just push the entire behaviour under the surface—still present, just not legible.