Since the sleeper agents paper, I’ve been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the “attention monitoring” approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.
A question I’m curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.
Since the sleeper agents paper, I’ve been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the “attention monitoring” approach you allude to
The “sleeper agent” paper I think needs to be reframed. The model isn’t plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training. Succinctly, “garbage in, garbage out”.
This looks like a general problem you will get when you train on any “dirty” data you got from any source, especially the public internet.
If you want a ‘clean’ LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it’s training examples are clean. I thought of a couple possible ways to do this:
Filter really hard by using 1 model trained on public data to generate “reference” correct outputs for billions of user questions and commands, and then filter the “reference” output by a series of instances of all SOTA LLMs to remove bad or incorrect reasoning.
Aka : LLM (public internet) → candidate outputs → evaluation by committee of LLMs → filtered output
2. Farther in the future, everything made by humans is suspect. Theoretically you could build a reasoning machine that learns only from direct empirical data collected by robots and trusted cameras and other sources of direct data. Aka :
Reasoning ASI (clean empirical data) + LLM → ASI system.
You still need an LLM to interpret and communicate with human users.
3. A hybrid way—you run the model, and every time the model makes a checkable output, you run a separate session that researches the question and answer and prioritizes empirical outputs and authoritative sources for factual claims.
Example :
User Prompt : Write me a python program that calculates [some value]
Model output : [code]
Validation session: back calculate the original value. Check it for reasonableness. Write the program in a different language and ensure the alternate implementation produces the same values. If validation passes, RL +1, if it fails, RL −1 (on the model weights that generated the answer)
The “sleeper agent” paper I think needs to be reframed. The model isn’t plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training.
This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn’t eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that’s not a given.
If you want a ‘clean’ LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it’s training examples are clean.
First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.
First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.
Note that every issue you mentioned here can be dealt with by trading off capabilities. Thats why from my perspective none of these are problems. Any real world machine does this, from a car to a bulldozer to an aircraft—there is tons of excess components just to make it safe. (Military vehicles shave these off..)
Learn from rare examples : you don’t have to, don’t take action when your confidence is low
Power seeking : limited time/scope tasks where doing the task immediately pays off while seeking power will not because of you have 1 minute to design a paperclip factory, you won’t gain by cognitive improvement and you won’t be scored on overall score across all tasks, “you” don’t exist more than a minute.
Sycophancy : don’t let the model know what the user thinks before answering
Deception: prevent situational awareness by limiting the data provided. Does the model plan to betray on Sept 23, 2027? Did you need to give it the present time and date for most tasks?
Yes all of these things reduce performance, like how airbags make a car slower and less fuel efficient.
I am not saying these aren’t concerns, as well as the new problems that real systems have, but you can accomplish a lot even after mitigating them.
Note that every issue you mentioned here can be dealt with by trading off capabilities.
Yes. The trend I see is “pursue capabilities, worry about safety as an afterthought if at all”. Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad’s proposal), but most orgs don’t want to bite the bullet of a heavy alignment tax.
I also think you’re underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising (though it lessens its danger), and if you want to avoid the capability, you’d need to avoid any mention of e.g. honesty in the training.
your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired.
I was thinking of in-prompt sycophancy. Thanks for clarifying. And yeah Gemini appears heavily altered by excessive RLHf.
. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising
Yes I meant strategic. If the model consistently lies about something the RLHf told it to say that’s fine.
For example the Gemini prevarication when you compare the estimated death tolls for historical figures. (Was George Washington worse than Mao?). That’s what Google Corp wants their model to say, it’s only an issue because there are not powerful and “neutral” tone models available on the market.
I think dishonesty is fine as well so long as when you assemble a pipeline of multiple models, they don’t coordinate.
That is no matter how stringently one model argues it’s case that Mao and George Washington are similarly bad, it can’t hide a “solidgoldmagicarp” stenography token that causes other models in the pipeline tasked with checking the output to betray and agree.
That’s where you need to structure outputs to leave no room for secret handshake, and use diverse models trained on different data with a different architecture so they don’t have the common background to secretly communicate.
And yeah at some level of capabilities this probably fails where models start communicating by varying the nanoseconds they send a message or something. So you have to try to mitigate this and so on in a security race that never ends until stability is reached. ( stable outcomes being the ASIs take everything, or humans upgrade themselves to be competitive)
Since the sleeper agents paper, I’ve been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the “attention monitoring” approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.
A question I’m curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.
The “sleeper agent” paper I think needs to be reframed. The model isn’t plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training. Succinctly, “garbage in, garbage out”.
This looks like a general problem you will get when you train on any “dirty” data you got from any source, especially the public internet.
If you want a ‘clean’ LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it’s training examples are clean. I thought of a couple possible ways to do this:
Filter really hard by using 1 model trained on public data to generate “reference” correct outputs for billions of user questions and commands, and then filter the “reference” output by a series of instances of all SOTA LLMs to remove bad or incorrect reasoning.
Aka : LLM (public internet) → candidate outputs → evaluation by committee of LLMs → filtered output
2. Farther in the future, everything made by humans is suspect. Theoretically you could build a reasoning machine that learns only from direct empirical data collected by robots and trusted cameras and other sources of direct data. Aka :
Reasoning ASI (clean empirical data) + LLM → ASI system.
You still need an LLM to interpret and communicate with human users.
3. A hybrid way—you run the model, and every time the model makes a checkable output, you run a separate session that researches the question and answer and prioritizes empirical outputs and authoritative sources for factual claims.
Example :
User Prompt : Write me a python program that calculates [some value]
Model output : [code]
Validation session: back calculate the original value. Check it for reasonableness. Write the program in a different language and ensure the alternate implementation produces the same values. If validation passes, RL +1, if it fails, RL −1 (on the model weights that generated the answer)
This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn’t eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that’s not a given.
First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.
Note that every issue you mentioned here can be dealt with by trading off capabilities. Thats why from my perspective none of these are problems. Any real world machine does this, from a car to a bulldozer to an aircraft—there is tons of excess components just to make it safe. (Military vehicles shave these off..)
Learn from rare examples : you don’t have to, don’t take action when your confidence is low
Power seeking : limited time/scope tasks where doing the task immediately pays off while seeking power will not because of you have 1 minute to design a paperclip factory, you won’t gain by cognitive improvement and you won’t be scored on overall score across all tasks, “you” don’t exist more than a minute.
Sycophancy : don’t let the model know what the user thinks before answering
Deception: prevent situational awareness by limiting the data provided. Does the model plan to betray on Sept 23, 2027? Did you need to give it the present time and date for most tasks?
Yes all of these things reduce performance, like how airbags make a car slower and less fuel efficient.
I am not saying these aren’t concerns, as well as the new problems that real systems have, but you can accomplish a lot even after mitigating them.
Yes. The trend I see is “pursue capabilities, worry about safety as an afterthought if at all”. Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad’s proposal), but most orgs don’t want to bite the bullet of a heavy alignment tax.
I also think you’re underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising (though it lessens its danger), and if you want to avoid the capability, you’d need to avoid any mention of e.g. honesty in the training.
I was thinking of in-prompt sycophancy. Thanks for clarifying. And yeah Gemini appears heavily altered by excessive RLHf.
Yes I meant strategic. If the model consistently lies about something the RLHf told it to say that’s fine.
For example the Gemini prevarication when you compare the estimated death tolls for historical figures. (Was George Washington worse than Mao?). That’s what Google Corp wants their model to say, it’s only an issue because there are not powerful and “neutral” tone models available on the market.
I think dishonesty is fine as well so long as when you assemble a pipeline of multiple models, they don’t coordinate.
That is no matter how stringently one model argues it’s case that Mao and George Washington are similarly bad, it can’t hide a “solidgoldmagicarp” stenography token that causes other models in the pipeline tasked with checking the output to betray and agree.
That’s where you need to structure outputs to leave no room for secret handshake, and use diverse models trained on different data with a different architecture so they don’t have the common background to secretly communicate.
And yeah at some level of capabilities this probably fails where models start communicating by varying the nanoseconds they send a message or something. So you have to try to mitigate this and so on in a security race that never ends until stability is reached. ( stable outcomes being the ASIs take everything, or humans upgrade themselves to be competitive)