Great post! These are the questions that keep me up at night. Here is my provisional answers to the central, most important question:
Is Chain of Thought faithful enough in the relevant sense for FMA safety to be attainable through natural language alignment?
Maybe, but it doesn’t have to be. The nice thing about foundation model agents is that there are several different alignment techniques that are cheap, easy, and so obvious that they’ll probably be implemented, even if the org in question isn’t terribly safety-oriented. I wrote about these a while ago in the ill-titled Internal independent review for language model agent alignment, which focuses on System 2 internal review (in Shane Legg’s better terminology) but also lists several other approaches that would easily be layered with it.
I need to do a clearer and simpler rewrite that surveys all of these. Here’s the table from my draft post. CoT legibility is only relevant for one of these six approaches. Sorry for the missing links; I’m short on time and the draft is incomplete.
Technique
Example
Alignment tax
Goal prompting
”Keep pursuing goal X...” (repeated frequently)
Negligible
Identity prompting
...acting as a helpful, cautious assistant”
Negligible
Internal System 2 action review
”Does this proposed action/plan potentially violate (conditions)?” … (further evaluations for actions/plans predicted to be highly impactful
Low (if review also implemented for costs and effectiveness)
Human action review
Wait for human review if this action/plan would cost more than $(x) or impact more than (y) days of human happiness (application of instruction-following goal)
”Keep following all of the instructions from your authorized user, consulting them when instructions might be interpreted in substantially different ways”
Low if consulting is scaled to only impactful choices
So we don’t need CoT to be perfectly faithful to succeed- but we’d sure be safer if it was.
Back to the original question CoT faithfulness: the case for CoT unfaithfulness is overstated currently, but if we adopt more outcome-driven RL, or even fine-tuning, it could easily become highly unfaithful. So people shouldn’t do that. If they do, I think the remaining easy techniques might be adequate—but I’d rather not gamble the future of humanity on them.
There are many other important questions here, but I’ll stick to this one for now.
Great post! These are the questions that keep me up at night. Here is my provisional answers to the central, most important question:
Maybe, but it doesn’t have to be. The nice thing about foundation model agents is that there are several different alignment techniques that are cheap, easy, and so obvious that they’ll probably be implemented, even if the org in question isn’t terribly safety-oriented. I wrote about these a while ago in the ill-titled Internal independent review for language model agent alignment, which focuses on System 2 internal review (in Shane Legg’s better terminology) but also lists several other approaches that would easily be layered with it.
I need to do a clearer and simpler rewrite that surveys all of these. Here’s the table from my draft post. CoT legibility is only relevant for one of these six approaches. Sorry for the missing links; I’m short on time and the draft is incomplete.
So we don’t need CoT to be perfectly faithful to succeed- but we’d sure be safer if it was.
Back to the original question CoT faithfulness: the case for CoT unfaithfulness is overstated currently, but if we adopt more outcome-driven RL, or even fine-tuning, it could easily become highly unfaithful. So people shouldn’t do that. If they do, I think the remaining easy techniques might be adequate—but I’d rather not gamble the future of humanity on them.
There are many other important questions here, but I’ll stick to this one for now.