Why don’t you think the goal misgeneralization papers or the plethora of papers finding in-context gradient descent in transformers and resnets count as mesa-optimization?
I think most of the goal misgeneralization examples are in fact POCs, but they’re pretty weak POCs and it would be much better if we had better POCs. Here’s a table of some key disanalogies:
Our examples
Deceptive alignment
Deployment behavior is similar to train behavior
Behaves well during training, executes treacherous turn on deployment
No instrumental reasoning
Train behavior relies on instrumental reasoning
Adding more diverse data would solve the problem
AI would (try to) behave nicely for any test we devise
Most (but not all) were designed to show goal misgeneralization
Goal misgeneralization happens even though we don’t design for it
I’d be excited to see POCs that maintained all of the properties in the existing examples and added one or more of the properties in the right column.
Why don’t you think the goal misgeneralization papers or the plethora of papers finding in-context gradient descent in transformers and resnets count as mesa-optimization?
I think most of the goal misgeneralization examples are in fact POCs, but they’re pretty weak POCs and it would be much better if we had better POCs. Here’s a table of some key disanalogies:
Our examples
Deceptive alignment
I’d be excited to see POCs that maintained all of the properties in the existing examples and added one or more of the properties in the right column.