cousin_it comments on POC || GTFO culture as partial antidote to alignment wordcelism

cousin_it 15 Mar 2023 10:54 UTC

4 points

What kinds of POC attacks would be the most useful for AI alignment right now? (Aside from ChatGPT jailbreaks)

lc 15 Mar 2023 11:09 UTC

20 points

IMO the hierarchy of POCs would be:

Proof of misalignment (relative to the company!) in real world, designed-by-engineer consumer products
Creating example POCs of failures using standard deep learning libraries and ML tools
Deliberately introducing weird tools, or training or testing conditions, for the purpose of “simulating” capabilities enhancement that might be necessary for certain kinds of problems to reveal themselves in advance

As an immediate, concrete example: figuring out how to create a POC mesa-optimizer using standard deep learning libraries would be the obvious big win, and AFAICT this has not been done. While writing this post I did some research and found out that the Alignment Research Center considers something like this an explicit technical goal of theirs, which made me happy and got me to pledge.

Mo Putera 31 Jan 2024 17:46 UTC
3 points
0
Parent
Would the recent Anthropic sleeper agents paper count as an example of bullet #2 or #3?

Garrett Baker 15 Mar 2023 17:11 UTC

1 point

Parent

Why don’t you think the goal misgeneralization papers or the plethora of papers finding in-context gradient descent in transformers and resnets count as mesa-optimization?

Rohin Shah 15 Mar 2023 19:14 UTC

22 points

Parent

I think most of the goal misgeneralization examples are in fact POCs, but they’re pretty weak POCs and it would be much better if we had better POCs. Here’s a table of some key disanalogies:

Our examples	Deceptive alignment
Deployment behavior is similar to train behavior	Behaves well during training, executes treacherous turn on deployment
No instrumental reasoning	Train behavior relies on instrumental reasoning
Adding more diverse data would solve the problem	AI would (try to) behave nicely for any test we devise
Most (but not all) were designed to show goal misgeneralization	Goal misgeneralization happens even though we don’t design for it

I’d be excited to see POCs that maintained all of the properties in the existing examples and added one or more of the properties in the right column.

What links here?

Noosphere89's comment on Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.) by Noosphere89 (16 Oct 2023 14:30 UTC; 11 points)