Proof of misalignment (relative to the company!) in real world, designed-by-engineer consumer products
Creating example POCs of failures using standard deep learning libraries and ML tools
Deliberately introducing weird tools, or training or testing conditions, for the purpose of “simulating” capabilities enhancement that might be necessary for certain kinds of problems to reveal themselves in advance
As an immediate, concrete example: figuring out how to create a POC mesa-optimizer using standard deep learning libraries would be the obvious big win, and AFAICT this has not been done. While writing this post I did some research and found out that the Alignment Research Center considers something like this an explicit technical goal of theirs, which made me happy and got me to pledge.
Why don’t you think the goal misgeneralization papers or the plethora of papers finding in-context gradient descent in transformers and resnets count as mesa-optimization?
I think most of the goal misgeneralization examples are in fact POCs, but they’re pretty weak POCs and it would be much better if we had better POCs. Here’s a table of some key disanalogies:
Our examples
Deceptive alignment
Deployment behavior is similar to train behavior
Behaves well during training, executes treacherous turn on deployment
No instrumental reasoning
Train behavior relies on instrumental reasoning
Adding more diverse data would solve the problem
AI would (try to) behave nicely for any test we devise
Most (but not all) were designed to show goal misgeneralization
Goal misgeneralization happens even though we don’t design for it
I’d be excited to see POCs that maintained all of the properties in the existing examples and added one or more of the properties in the right column.
What kinds of POC attacks would be the most useful for AI alignment right now? (Aside from ChatGPT jailbreaks)
IMO the hierarchy of POCs would be:
Proof of misalignment (relative to the company!) in real world, designed-by-engineer consumer products
Creating example POCs of failures using standard deep learning libraries and ML tools
Deliberately introducing weird tools, or training or testing conditions, for the purpose of “simulating” capabilities enhancement that might be necessary for certain kinds of problems to reveal themselves in advance
As an immediate, concrete example: figuring out how to create a POC mesa-optimizer using standard deep learning libraries would be the obvious big win, and AFAICT this has not been done. While writing this post I did some research and found out that the Alignment Research Center considers something like this an explicit technical goal of theirs, which made me happy and got me to pledge.
Would the recent Anthropic sleeper agents paper count as an example of bullet #2 or #3?
Why don’t you think the goal misgeneralization papers or the plethora of papers finding in-context gradient descent in transformers and resnets count as mesa-optimization?
I think most of the goal misgeneralization examples are in fact POCs, but they’re pretty weak POCs and it would be much better if we had better POCs. Here’s a table of some key disanalogies:
Our examples
Deceptive alignment
I’d be excited to see POCs that maintained all of the properties in the existing examples and added one or more of the properties in the right column.