I work at Redwood Research.
ryan_greenblatt
Q: Why bother splitting shoggoth and face? Why not simply blind the reinforcement process to the internal CoT, like OpenAI claims to be doing with o1? You coulld still have a paraphraser too.
A: Because if you use the same model weights to do both the internal reasoning and the external token outputs, there’s more of a chance that there will be ‘skill transfer’ or ‘skill leakage’ so to speak. What I am hoping will happen with shoggoth/face is that the Face will develop skills related to euphemisms, censorship, etc. and the Shoggoth will not develop those skills. For example maybe the Face will learn an intuitive sense of how to rephrase manipulative and/or deceptive things so that they don’t cause a human reading them to think ‘this is manipulative and/or deceptive,’ and then the Shoggoth will remain innocently ignorant of those skills. Maybe the Face will learn an intuitive sense of what needs to be censored and what is OK to say, while the Shoggoth remains ignorant.
Thanks for this description, I think we had previously talked about this and I wasn’t sold and I now understand the argument better. I think I roughly buy this argument though it only seems moderately strong.
One remaining objection I have is that very often you want to do revision (editing, review, etc) on the actual output. If the actual output needs to differ substantially from what the underlying policy (shoggoth) generates because of incentivized censorship/manipulation, then that will cause issues and thus you will actually still incentivize the policy rather than the Face to do this work.
A further issue I’ve realized on additional examination is that I don’t see a reason why to expect that the models will specialize in this way with the training scheme you’ve proposed. In particular, the policy can just learn to do what the Face does and by default we should expect that the responsibility gets split in some arbitrary way subject to the constraint that the Face doesn’t output anything that performs poorly. But, the underlying policy totally could do the work of the Face! I expect that what you get in practice will be somewhat sensitive to the initialization and random other factors.
Overall, I would probably lean against splitting into a Face model like this unless we demonstrate this working better in a test bed given the additional complexity and cost.
I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don’t necessarily buy this is how the paperclip-maximization-trained AI will generalize!
(I’m picking up this thread from 7 months ago, so I might be forgetting some important details.)
I guess one way of framing it is that I find the shoggoth/face idea great as a science experiment; it gives us useful evidence! However, it doesn’t make very much sense to me as a safety method intended for deployment.
Sadly, gathering evidence of misalignment in deployment seems likely to me to be one of the most effective strategies for gathering legible evidence (at least for early systems) given likely constraints. (E.g., because people won’t believe results in text beds and because RL might be too expensive to run twice.)
Sure, I mean that I expect that for the critical regime for TAI, we can literally have the Face be 3.5 sonnet.
Or get some other work out of these systems such that you greatly reduce risk going forward.
For reference, @habryka has now posted them here.
I agree there are various ways the Face could cause problems via influencing the training of the actual policy. (Both more wild influence that would require clever manipulation and a more mundane reduction of disincentive for problematic reasoning (though I’d note this mundane issue is the core goal of the original proposal!).)
I don’t think it is worth getting into all the stuff you’re mentioning here, but I think a key crux is that I’m expecting the face to be quite dumb (e.g. literally 3.5 sonnet might suffice).
It might be easier to try to establish some track record by doing a small research project first. I don’t know if you have enough runway for this though.
To the extent that complexity makes succesfully red-teaming our approaches too difficult, then we could try to mostly depend on simple approaches and only use more complex approaches only if we are confident that at least they won’t can’t make the situation worse.
My general view is that complex methods are only productive if you have a good feedback loop (either empirical feedback or a very robust theoretical feedback (e.g. proofs)). In the case of control, so long as we do a good job with red teaming[1], we can build a good feedback loop such that complex methods can be productive. I think a similar process is often used in computer security and I think computer security often involves complex measures[2].
As an aside, this also makes me relatively pessimistic about developing complex and unprincipled alignment (non-control) approaches for hard-to-experiment-on threat models like scheming (aka deceptive alignment). It will be difficult to make scheming model organisms which realistically reproduce the internal structure and capability level of models we’re worried about and thus it will be hard to effectively iterate on complex and unprincipled measures.
An alternative approach for empirics is to focus on highly principled measures where we have a story for why they should generalize to scheming that we can test and iterate against (e.g. we’ll understand how the model works which we can test by seeing if we can understand models in general).
- ↩︎
Which isn’t clearly possible given that the relevant AIs could have very different capability profiles from humans. I think this will be fine in practice, but this is non-obvious.
- ↩︎
I don’t think the ideas we consider are more complex than typical approaches in security, though Buck would know better than me.
- ↩︎
On the other hand, your version of Daniel’s protocol loses this property.
I’m talking about training only the Face, not training the policy (shoggoth) at all with the proposal I’m imagining.
And, these should clearly be separate models such that the training of one doesn’t generalize to the other.
So, making the Face more deceptive doesn’t kill the canary? The question is whether we end up with a highly misaligned policy (and this is clear from the chain of thought such that we can discover this). I think Daniel was expecting in the baseline proposal that you end up with a Face that does at least some censoring/mild deception due to training incentives (I think we already see a bit of this with o1).
Maybe your concern is that it is plausible that the policy won’t be misaligned, but the Face will be and this will screw us over. This seems plausible and is worth worrying about (with both proposals). However, I thought that a core part of the hope is that the Face isn’t smart enough to cleverly scheme (unless this schemeing happens in the CoT itself). And indeed, if the Face can cleverly scheme without CoT, what is stopping the policy from doing this too?
(It might also be possible to reduce the extent to which the initial Face has control over the world. E.g., you can initialize one Face to use for the first part of training (including the proposal I was mentioning) and then train another later on to imitate the earlier Face on a subpart of the distribution you inspect with the hope that it doesn’t inherit learned misalignment. You could also deploy a different Face than the one you use for training if you aren’t doing online training.)
By “control”, I mean AI Control: approaches aiming to ensure safety and benefit from AI systems, even if they are goal-directed and are actively trying to subvert your control measures.
AI control stops working once AIs are sufficiently capable (and likely don’t work for all possible deployments that might eventually be otherwise desirable), but there could be other approaches that work at that point. In particular aligning systems.
The main hope I think about is something like:
Use control until AIs are capable enough that if we trusted them, we could obsolete top human scientists and experts.
Use our controlled AI labor to do the work needed to make systems which are capable enough, trustworthy enough (via alignment), and philosophically competent enough that we can safely hand things off to them. (There might be some intermediate states to get to here.)
Have these systems which totally obsolete us figure out what to do, including figuring out how to aligning more powerful systems as needed.
We discuss our hopes more in this post.
You can still inspect the underlying CoT to know if you have big issues.
It is worth noting that the baseline proposal was already incentivizing the Face to be deceptive to the extent to which this proposal is important (the whole hope was to allow for hiding deception). So, I agree that my proposed change trades off “less deceptive CoT” for “more deceptive Face” relative to the baseline proposal, but this feels like this is mostly a quantitative difference?
(My guess is that you have this concern with the baseline protocol and thus don’t like moving further in this direction.)
It could be that the baseline proposal doesn’t make sense because shaping with process based feedback on the CoT is so useful that it is worth taking the risk of incentivizing unfaithful CoT.
make the system overall much harder to shape in ways you want (because it is actively hiding things from you so your RL is less accurate)
I thought the point was that we wouldn’t train against the CoT basically at all with the baseline proposal? (Similar to o1.)
OpenAI models are seemingly trained on huge amounts of chess data, perhaps 1-4% of documents are chess (though chess documents are short, so the fraction of tokens which are chess is smaller than this).
We can make scheming legible at least to sophisticated-scheming-skeptics like Sam Altman and Dario.
If it was enough evidence that I was strongly convinced sure. But IDK if I would be convinced because the evidence might be actually unclear.
I agree you’ll be able to get some work out, but you might be taking a bit productivity hit.
Also, TBC, I’m not generally that worried about generic sandbagging on safety research relative to other problems.
I’m not so sure about this being easy to catch. Suppose we’re actually uncertain (as I am now) about whether AIs will be differentially much less useful for safety due to difficulties checking and AIs generally being worse at philosophy. Then, we see that AIs really suck at coming up with good ideas in safety and at pursuing the right sort of experiments.
What should we conclude? What will we be able to legibly argue?
I agree we’ll know we aren’t getting out much value in this case, but that doesn’t help that much...
(I agree about countermeasures being pretty doable as I noted in the parent.)
Why not think that sandbagging could reduce productivity by ~100%? I think it can do this in human research organizations.
I happen to think this is pretty doable to address with countermeasures, but certainly without countermeasures this could be quite brutal.
As far as safety research sabotage, I’m often imagining careful sabotage rather than generic sandbagging.
As in, the AI intentionally inserts very subtle errors at adversarially determined times. It might need to maintain plausible deniability in case it is caught a small number of times, but you should generally imagine an error pattern that is quite different from just making more mistakes or being dumber.
I can imagine that this results in your automated safety research being worthless or of negative value.
TBC, it seems reasonably likely to me that in the relevant regime, basic measures like having humans spend a long time trying to understand an output suffice for catching bad behavior and/or training out bad behavior.
You can use the first X% of scenarios at deployment time and audit them with humans.
Only if you don’t have privacy guarantees?
Also, for the honeypots, you can potentially generate vast numbers of honeypots, use a trusted model to review outputs, and then only have humans look at the cases where the trusted model thinks the output is suspicious.
Your control protocols should be able to prevent the best advisory you can simulate from breaking them.
Surely this will be unnecessarily conservative a bunch of the time?
In the limit, the red team could provide a human that reads every task, sits and thinks for hours, and then responds with the best output. This isn’t done in practice since it is expensive and slow.
Or it is overly conservative?
Testing sounds great though I’d note that the way I’d approach testing is to first construct a general test bed that produces problematic behavior through training incentives for o1-like models. (Possibly building on deepseek’s version of o1.) Then I’d move to trying a bunch of stuff in this setting.
(I assume you agree, but thought this would be worth emphasizing to third parties.)