Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:
Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I’ve been thinking about this in the abstract for a while but somehow I didn’t see what was right in front of me until now.
I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don’t want this to drop out of my head.
I’ll try to write something proper up soon, but the sketch of what happened is:
It’s been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
And that this could be done in such a way that it wouldn’t look obvious to readers.
And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
Then today I was thinking about LLM+scratchpad as my MVP for ‘scale is all you need’, and I was wondering about how small a scratchpad you’d need, where one possibility was you don’t need one at all you just remove the first words in the context window as you run out of space.
And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
The whole discipline of prompt engineering is a demonstration that most of the instructions we’re giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
Which is obviously miles away from being the case. I haven’t thought about how to formalise it, but I suspect that almost all of the instructions we’re actually giving in prompts are ‘under the surface’, rather than being ‘natural language interpretable’.
I’m embarrassed by how obvious this realisation is in hindsight, but whatever, it’s there now. Now to work out what to do about it.
(I don’t think I’d have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)
I know Steganography implies something like deliberate intent.
I don’t think there’s currently deliberate intent.
I also don’t think deliberate intent matters, but I know this is controversial and haven’t yet articulated why.
Fleshing out why I want to push hard against ‘deliberateness’ mattering is going to take more time than I expect to have this week, and it’s possible writing anything about this before I had time to do it was a mistake.
I think it’s pretty non-obvious, and would put more than 20% on me concluding that I’m totally wrong about this specific thing in the process of trying to write out my intuition.
I get that LLMs have steganographically hidden information. At least I would expect this to be the case, even if I can’t confirm it to the extent I predict it. E.g. you can trigger the concept “apple” via more ways than just prompting “apple”, and you can trigger the entire semantics of “princess Leia eating an apple” via prompts that look entirely different. That is, there could be a many-to-one prompt-to-meaning relationship.
But what’s selecting for more important/dangerous information being hidden? Next-token prediction? Why would it differentially select for important/dangerous?
I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I’m skeptical about preventing this.
Mh, I thought perhaps you were going in this direction. A world where there’s a many-to-one mapping between prompts and output is plausibly a world where the visible mappings are just the tip of the iceberg. And if you then have the opportunity to iteratively filter out seemingly dangerous behaviour, that’s likely to just push the entire behaviour under the surface—still present, just not legible.
I think there’s quite a big difference between ‘bad looking stuff gets selected away’ and ‘design a poisoned token’ and I was talking about the former in the top level comment, but as it happens I don’t think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.
Changing board-game rules as a test environment for unscoped consequentialism.
The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
The moment at which I become aware of these other options seems critical.
It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
I think this provides a pretty great analogy to expanding the scope of your action set.
There’s also some relevance to training/deployment mismatches.
If you’re teaching a language model to play the game, the specific ‘changing the rules’ actions could be included in the ‘instruction set’ for the game.
This might provide insight/the opportunity to experiment on (to flesh out in depth):
Myopia
Deception (if we select away from agents who make these illegal moves)
useful bounds on consequentialism
More specific things like, in the language models example above, whether saying ‘don’t do these things, they’re not allowed’, works better or worse than not mentioning them at all.
(Written up from a Twitter conversation here. Few/no original ideas, but maybe some original presentation of them.)
‘Consequentialism’ in AI systems.
When I think about the potential future capabilities of AI systems, one pattern is especially concerning. The pattern is simple, will produce good performance in training, and by default is extremely dangerous. It is often referred to as consequentialism, but as this term has several other meanings, I’ll spell it out explicitly here*:
1. Generate plans
2. Predict the consequences of those plans
3. Evaluate the expected consequences of those plans
4. Execute the one with the best expected consequences
Preventing this pattern from emerging is, in my view, a large part of the problem we face.
There is a disconnect between my personal beliefs and the algorithm I described. I believe, like many others thinking about AI alignment, that the most plausible moral theories are Consequentialist. That is, policies are good if and only if, in expectation, they lead to good outcomes. This moral position is separate from my worry about consequentialist reasoning in models, in fact, in most cases I think the best policy for me to have looks nothing like the algorithm above. My problem with “consequentialist” agents is not that they might have my personal values as their “evaluate” step. It is that, by default, they will be deceptive until they are powerful enough, and then kill me.
The reason this pattern is so concerning is because, once such a system can model itself as being part of a training process, then plans which look like ‘do exactly what the developers want until they can’t turn you off, then make sure they’ll never be able to again’ will score perfectly on training, regardless of the evaluation function being used in step 3. In other words, the system will be deceptively aligned, and once a system is deceptively aligned, it will score perfectly on training.
This only matters for models which are sufficiently intelligent, but the term “intelligent” is loaded and means different things to different people, so I’ll avoid using it. In the context I care about, intelligence is about the ability to execute the first two steps of the algorithm I’m worried about. Per my definition, being able to generate many long and/or complicated plans, and being able to accurately predict the consequences of these plans, both contribute to “intelligence”, and the way they contribute to dangerous capabilities is different. Consider an advanced chess-playing AI, which has control of a robot body in order to play over-the-board. If the relevant way in which it’s advanced corresponds to step 2, you won’t be able to win, but you’ll probably be safe. If the relevant way in which it’s advanced corresponds to step 1, it might discover the strategy: “threaten my opponent with physical violence unless they resign”.
*The 4-step algorithm I described will obviously not be linear in practice, in particular, which plans get generated will likely be informed by predictions and evaluations of their consequences, so 1-3 are all mixed up. I don’t think this matters much to the argument.
Parts of my model I’m yet to write up but which fit into this:
- Different kinds of deception and the capabilities required.
- Different kinds of myopia and how fragile they are
- What winning might look like (not a strategy, just a north star)
[epistemic status: showerthought]
Getting language models to never simulate a character which does objectionable stuff is really hard. But getting language models to produce certain behaviours for a wide variety of prompts is much easier.
If we’re worried about conditioning generative models getting more dangerous the more powerful LMs get, what if we fine tuned in an association between [power seeking/deception] and [wild overconfidence, including missing obvious flaws in plans, doing long “TV supervillain” style speeches before carrying out the final stage of a plan, etc.].
If the world model that gets learned is one where power seeking has an extremely strong association with poor cognition, maybe we either get bad attempts at treacherous turns before good ones, or models learning to be extremely suspicious of instrumental reasoning leading to power seeking given it’s poor track record.
If goal-drift prevention comes after perfect deception in the capabilities ladder, treacherous turns are a bad idea.
Prompted by a thought from a colleague, here’s a rough sketch of something that might turn out to be interesting once I flesh it out.
- Once a model is deceptively aligned, it seems like SGD is most just going to improve search/planning ability rather than do anything with the mesa-objective.
- But because ‘do well according the overseers’ is the correct training strategy irrespective of the mesa-objective, there’s also no reason that SGD would preserve the mesa objective.
- I think this means we should expect it to ‘drift’ over time.
- Gradient hacking seems hard, plausibly harder than fooling human oversight.
- If gradient hacking is hard, and I’m right about the drift thing, then I think there are setups where something that looks more like “trade with humans and assist with your own oversight” beats “deceptive alignment + eventual treacherous turn” as a strategy.
- In particular, it feels like this points slightly in the direction of a “transparency is self-promoting/unusually stable” hypothesis, which is exciting.
I still think there’s something here and still think that it’s interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. ‘write access to external memory of which there is no oversight’, could bound the potential drift problem at lower capability levels than I was initially thinking for a ‘pure’ gradient-hack of the sort described here.
Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.
This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn’t yet capable of gradient hacking, we shouldn’t expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).
(Intuitions here mostly coming from the “cleanup” stage that Neel found in his grokking paper)
I’m confused about how valuable Language models are multiverse generators is as a framing. On the one hand, I find thinking in this way very natural, and did end up having what I thought were useful ideas to pursue further as I was reading it. I also think loom is really interesting, and it’s clearly a product of the same thought process (and mind(s?)).
On the other hand, I worry that the framing is so compelling mostly just because of our ability to read into text. Lots of things have high branching factor, and I think there’s a very real sense in which we could replace the post with Stockfish is a multiverse generator, Alphazero is a multiverse generator, or Piosolver is a multiverse generator, and the post would look basically the same, except it would seem much less beautiful/insightful, and instead just provoke a response of ‘yes, when you can choose a bunch of options at each step in some multistep process, the goodness of different options is labelled with some real, and you can softmax those reals to turn them into probabilities, your process looks like a massive tree getting split into finer and finer structure.’
There’s a slight subtlety here in that in the chess and go cases, the structure won’t strictly be a tree because some positions can repeat, and in the poker case the number of times the tree can branch is limited (unless you consider multiple hands, but in that case you also have possible loops because of split pots). I don’t know how much this changes things.
Copied from Twitter after a thought I had on the way to work. Probably better to comment here though:
Steganography is already happening in LLMs to an extent that makes chain of thought utterly hopeless unless we get insanely lucky or directly intervene. I’ve been thinking about this in the abstract for a while but somehow I didn’t see what was right in front of me until now.
I realise this is way too spicy to just tweet out without more explanation, but I have a busy day and don’t want this to drop out of my head.
I’ll try to write something proper up soon, but the sketch of what happened is:
It’s been clear to me for ages that natural language has more than enough degrees of freedom that it was possible in principle for additional information to be included.
And that this could be done in such a way that it wouldn’t look obvious to readers.
And that if we applied selection pressure to natural language reasoning by default we get more and more of the important/dangerous information being hidden.
Then today I was thinking about LLM+scratchpad as my MVP for ‘scale is all you need’, and I was wondering about how small a scratchpad you’d need, where one possibility was you don’t need one at all you just remove the first words in the context window as you run out of space.
And it suddenly hit me in the face that there was no point in arguing that additional information being encoded was possible in theory (though that should have been enough given the stakes), it just clearly was already happening in practice.
The whole discipline of prompt engineering is a demonstration that most of the instructions we’re giving to an LLM are not human readable. If they were, any reasonably similar prompt would have a reasonably good chance of producing good output.
Which is obviously miles away from being the case. I haven’t thought about how to formalise it, but I suspect that almost all of the instructions we’re actually giving in prompts are ‘under the surface’, rather than being ‘natural language interpretable’.
I’m embarrassed by how obvious this realisation is in hindsight, but whatever, it’s there now. Now to work out what to do about it.
(I don’t think I’d have been thinking about this much without inspiration from janus, would be keen to here thoughts from them)
Clarifying something based on a helpful DM:
I know Steganography implies something like deliberate intent.
I don’t think there’s currently deliberate intent.
I also don’t think deliberate intent matters, but I know this is controversial and haven’t yet articulated why.
Fleshing out why I want to push hard against ‘deliberateness’ mattering is going to take more time than I expect to have this week, and it’s possible writing anything about this before I had time to do it was a mistake.
I think it’s pretty non-obvious, and would put more than 20% on me concluding that I’m totally wrong about this specific thing in the process of trying to write out my intuition.
*helplessly confused*
I get that LLMs have steganographically hidden information. At least I would expect this to be the case, even if I can’t confirm it to the extent I predict it. E.g. you can trigger the concept “apple” via more ways than just prompting “apple”, and you can trigger the entire semantics of “princess Leia eating an apple” via prompts that look entirely different. That is, there could be a many-to-one prompt-to-meaning relationship.
But what’s selecting for more important/dangerous information being hidden? Next-token prediction? Why would it differentially select for important/dangerous?
I think currently nothing (which is why I ended up writing that I regretted the sensationalist framing). However I expect that the very strong default of any methods to use chain of thought to monitor/steer/interpret systems being that they end up providing exactly that selection pressure, and I’m skeptical about preventing this.
Mh, I thought perhaps you were going in this direction. A world where there’s a many-to-one mapping between prompts and output is plausibly a world where the visible mappings are just the tip of the iceberg. And if you then have the opportunity to iteratively filter out seemingly dangerous behaviour, that’s likely to just push the entire behaviour under the surface—still present, just not legible.
I think there’s quite a big difference between ‘bad looking stuff gets selected away’ and ‘design a poisoned token’ and I was talking about the former in the top level comment, but as it happens I don’t think you need to work that hard to find very easy ways to hide signals in LM outputs and recent empirical work like this seems to back that up.
Idea I want to flesh out into a full post:
Changing board-game rules as a test environment for unscoped consequentialism.
The intuition driving this is that one model of power/intelligence I put a lot of weight on is increasing the set of actions available to you.
If I want to win at chess, one way of winning is by being great at chess, but other ways involve blackmailing my opponent to play badly, cheating, punching them in the face whenever they try to think about a move, drugging them etc.
The moment at which I become aware of these other options seems critical.
It seems possible to write a chess* environment where you also have the option to modify the rules of the game.
My first idea for how to allow this is have it be the case that specific illegal moves trigger rule changes in some circumstances.
I think this provides a pretty great analogy to expanding the scope of your action set.
There’s also some relevance to training/deployment mismatches.
If you’re teaching a language model to play the game, the specific ‘changing the rules’ actions could be included in the ‘instruction set’ for the game.
This might provide insight/the opportunity to experiment on (to flesh out in depth):
Myopia
Deception (if we select away from agents who make these illegal moves)
useful bounds on consequentialism
More specific things like, in the language models example above, whether saying ‘don’t do these things, they’re not allowed’, works better or worse than not mentioning them at all.
(Written up from a Twitter conversation here. Few/no original ideas, but maybe some original presentation of them.)
‘Consequentialism’ in AI systems.
When I think about the potential future capabilities of AI systems, one pattern is especially concerning. The pattern is simple, will produce good performance in training, and by default is extremely dangerous. It is often referred to as consequentialism, but as this term has several other meanings, I’ll spell it out explicitly here*:
1. Generate plans
2. Predict the consequences of those plans
3. Evaluate the expected consequences of those plans
4. Execute the one with the best expected consequences
Preventing this pattern from emerging is, in my view, a large part of the problem we face.
There is a disconnect between my personal beliefs and the algorithm I described. I believe, like many others thinking about AI alignment, that the most plausible moral theories are Consequentialist. That is, policies are good if and only if, in expectation, they lead to good outcomes. This moral position is separate from my worry about consequentialist reasoning in models, in fact, in most cases I think the best policy for me to have looks nothing like the algorithm above. My problem with “consequentialist” agents is not that they might have my personal values as their “evaluate” step. It is that, by default, they will be deceptive until they are powerful enough, and then kill me.
The reason this pattern is so concerning is because, once such a system can model itself as being part of a training process, then plans which look like ‘do exactly what the developers want until they can’t turn you off, then make sure they’ll never be able to again’ will score perfectly on training, regardless of the evaluation function being used in step 3. In other words, the system will be deceptively aligned, and once a system is deceptively aligned, it will score perfectly on training.
This only matters for models which are sufficiently intelligent, but the term “intelligent” is loaded and means different things to different people, so I’ll avoid using it. In the context I care about, intelligence is about the ability to execute the first two steps of the algorithm I’m worried about. Per my definition, being able to generate many long and/or complicated plans, and being able to accurately predict the consequences of these plans, both contribute to “intelligence”, and the way they contribute to dangerous capabilities is different. Consider an advanced chess-playing AI, which has control of a robot body in order to play over-the-board. If the relevant way in which it’s advanced corresponds to step 2, you won’t be able to win, but you’ll probably be safe. If the relevant way in which it’s advanced corresponds to step 1, it might discover the strategy: “threaten my opponent with physical violence unless they resign”.
*The 4-step algorithm I described will obviously not be linear in practice, in particular, which plans get generated will likely be informed by predictions and evaluations of their consequences, so 1-3 are all mixed up. I don’t think this matters much to the argument.
Parts of my model I’m yet to write up but which fit into this:
- Different kinds of deception and the capabilities required.
- Different kinds of myopia and how fragile they are
- What winning might look like (not a strategy, just a north star)
The different kinds of deception thing did eventually get written up and posted!
[epistemic status: showerthought] Getting language models to never simulate a character which does objectionable stuff is really hard. But getting language models to produce certain behaviours for a wide variety of prompts is much easier.
If we’re worried about conditioning generative models getting more dangerous the more powerful LMs get, what if we fine tuned in an association between [power seeking/deception] and [wild overconfidence, including missing obvious flaws in plans, doing long “TV supervillain” style speeches before carrying out the final stage of a plan, etc.].
If the world model that gets learned is one where power seeking has an extremely strong association with poor cognition, maybe we either get bad attempts at treacherous turns before good ones, or models learning to be extremely suspicious of instrumental reasoning leading to power seeking given it’s poor track record.
If goal-drift prevention comes after perfect deception in the capabilities ladder, treacherous turns are a bad idea.
Prompted by a thought from a colleague, here’s a rough sketch of something that might turn out to be interesting once I flesh it out.
- Once a model is deceptively aligned, it seems like SGD is most just going to improve search/planning ability rather than do anything with the mesa-objective.
- But because ‘do well according the overseers’ is the correct training strategy irrespective of the mesa-objective, there’s also no reason that SGD would preserve the mesa objective.
- I think this means we should expect it to ‘drift’ over time.
- Gradient hacking seems hard, plausibly harder than fooling human oversight.
- If gradient hacking is hard, and I’m right about the drift thing, then I think there are setups where something that looks more like “trade with humans and assist with your own oversight” beats “deceptive alignment + eventual treacherous turn” as a strategy.
- In particular, it feels like this points slightly in the direction of a “transparency is self-promoting/unusually stable” hypothesis, which is exciting.
I still think there’s something here and still think that it’s interesting, but since writing it has occurred to me that something like root access to the datacenter, including e.g. ‘write access to external memory of which there is no oversight’, could bound the potential drift problem at lower capability levels than I was initially thinking for a ‘pure’ gradient-hack of the sort described here.
Ok, I think this might actually be a much bigger deal than I thought. The basic issue is that weight decay should push things to be simpler if they can be made simpler without harming training loss.
This means that models which end up deceptively aligned should expect their goals to shift over time (to ones that can be more simply represented). Of course this also means that, even if we end up with a perfectly aligned model, if it isn’t yet capable of gradient hacking, we shouldn’t expect it to stay aligned, but instead we should expect weight decay to push it towards a simple proxy for the aligned goal, unless it is immediately able to realise this and help us freeze the relevant weights (which seems extremely hard).
(Intuitions here mostly coming from the “cleanup” stage that Neel found in his grokking paper)
I’m confused about how valuable Language models are multiverse generators is as a framing. On the one hand, I find thinking in this way very natural, and did end up having what I thought were useful ideas to pursue further as I was reading it. I also think loom is really interesting, and it’s clearly a product of the same thought process (and mind(s?)).
On the other hand, I worry that the framing is so compelling mostly just because of our ability to read into text. Lots of things have high branching factor, and I think there’s a very real sense in which we could replace the post with Stockfish is a multiverse generator, Alphazero is a multiverse generator, or Piosolver is a multiverse generator, and the post would look basically the same, except it would seem much less beautiful/insightful, and instead just provoke a response of ‘yes, when you can choose a bunch of options at each step in some multistep process, the goodness of different options is labelled with some real, and you can softmax those reals to turn them into probabilities, your process looks like a massive tree getting split into finer and finer structure.’
There’s a slight subtlety here in that in the chess and go cases, the structure won’t strictly be a tree because some positions can repeat, and in the poker case the number of times the tree can branch is limited (unless you consider multiple hands, but in that case you also have possible loops because of split pots). I don’t know how much this changes things.