Rohin Shah comments on POC || GTFO culture as partial antidote to alignment wordcelism

Rohin Shah 15 Mar 2023 19:14 UTC

22 points

I think most of the goal misgeneralization examples are in fact POCs, but they’re pretty weak POCs and it would be much better if we had better POCs. Here’s a table of some key disanalogies:

Our examples	Deceptive alignment
Deployment behavior is similar to train behavior	Behaves well during training, executes treacherous turn on deployment
No instrumental reasoning	Train behavior relies on instrumental reasoning
Adding more diverse data would solve the problem	AI would (try to) behave nicely for any test we devise
Most (but not all) were designed to show goal misgeneralization	Goal misgeneralization happens even though we don’t design for it

I’d be excited to see POCs that maintained all of the properties in the existing examples and added one or more of the properties in the right column.

What links here?

Noosphere89's comment on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by Noosphere89 (16 Oct 2023 14:30 UTC; 12 points)