Seth Herd comments on Model Integrity: MAI on Value Alignment

Seth Herd 6 Dec 2024 2:40 UTC
2 points
0
Now I’ve actually skimmed the article. I was not wrong in my initial guess.

The central claim of the article:

This distinction becomes crucial as AI systems become more powerful—many would prefer [sic—not?] a compliant assistant, but a co-founder with integrity.

Is comically poorly chosen.

Actual individuals I know have frequently thought they’d like a co-founder with integrity, only to change their mind as that co-founders’ integrity causes them to fight for control of the shared project.

This is an excellent metaphor for the alignment problem, and the problem with value alignment as a goal (see my other comment). If you get it just a little bit wrong, you’ll regret it. A lot. Even if your “co-founder with integrity” merely seizes control of your shared project peacefully, rather than doing what a superintelligence would, and doing it by any means necessary.

To put it another way: this proposal shows the problem with thinking about “aligning” language models and limited language model agent: if you don’t think about the next step, creating a fully competent, autonomous entity, your alignment efforts are dangerous, not useful.

What people really want is an assistant/cofounder that is highly competent and as autonomous as you tell them to be—until you tell them to stop. Integrity means pursuing your values even when your friends beg you to stop. That is exactly what we do not want from AGI.