Seth Herd comments on ~80 Interesting Questions about Foundation Model Agent Safety

Seth Herd 28 Oct 2024 19:18 UTC

7 points

Great post! These are the questions that keep me up at night. Here is my provisional answers to the central, most important question:

Is Chain of Thought faithful enough in the relevant sense for FMA safety to be attainable through natural language alignment?

Maybe, but it doesn’t have to be. The nice thing about foundation model agents is that there are several different alignment techniques that are cheap, easy, and so obvious that they’ll probably be implemented, even if the org in question isn’t terribly safety-oriented. I wrote about these a while ago in the ill-titled Internal independent review for language model agent alignment, which focuses on System 2 internal review (in Shane Legg’s better terminology) but also lists several other approaches that would easily be layered with it.

I need to do a clearer and simpler rewrite that surveys all of these. Here’s the table from my draft post. CoT legibility is only relevant for one of these six approaches. Sorry for the missing links; I’m short on time and the draft is incomplete.

Technique	Example	Alignment tax
Goal prompting	”Keep pursuing goal X...” (repeated frequently)	Negligible
Identity prompting	...acting as a helpful, cautious assistant”	Negligible
Internal System 2 action review	”Does this proposed action/plan potentially violate (conditions)?” … (further evaluations for actions/plans predicted to be highly impactful	Low (if review also implemented for costs and effectiveness)
Human action review	Wait for human review if this action/plan would cost more than $(x) or impact more than (y) days of human happiness (application of instruction-following goal)	High, but saves costs and reputational damage
External CoT review	Human and external AI review of chain of thought log	Modest for AI, high if reliant on human review
”Bitter lesson” synthetic data training set	Curated training set for decision-making LLM leaving out hostile/misaligned “thoughts”	High, but modest if synthetic data is a key approach for next-gen LLMs
Instruction-following as core goal	”Keep following all of the instructions from your authorized user, consulting them when instructions might be interpreted in substantially different ways”	Low if consulting is scaled to only impactful choices

So we don’t need CoT to be perfectly faithful to succeed- but we’d sure be safer if it was.

Back to the original question CoT faithfulness: the case for CoT unfaithfulness is overstated currently, but if we adopt more outcome-driven RL, or even fine-tuning, it could easily become highly unfaithful. So people shouldn’t do that. If they do, I think the remaining easy techniques might be adequate—but I’d rather not gamble the future of humanity on them.

There are many other important questions here, but I’ll stick to this one for now.