Zack_M_Davis comments on But What’s Your New Alignment Insight, out of a Future-Textbook Paragraph?

Zack_M_Davis 15 Jun 2022 5:38 UTC
4 points
(Thanks for your patience.)

In fact, large language models arguably implement social instincts with more adroitness than many humans possess.

Large language models implement social behavior as expressed in text. I don’t want to call that social “instincts”, because the implementation, and out-of-distribution behavior, is surely going to be very different.

But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say “smile maximizer”, is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization?

A future with humans and smile-maximizers is more diverse than a future with just humans. (But, yes, “smile maximizer” here is our standard probably-unrealistic stock example standing in for inner alignment failures in general.)

That’s the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.

Trying again: the reason I don’t want to call that “safety” is because, even if you’re less likely to completely miss something important, you’re more likely to accidentally incorporate something you actively don’t want.

If we start out with a system where I press the reward button when the AI makes me happy, and it scales and generalizes into a diverse coalition of a smile-maximizer, and a number-of-times-the-human-presses-the-reward-button-maximizer, and an amount-of-dopamine-in-the-human’s-brain-maximizer, plus a dozen or a thousand other things … okay, I could maybe believe that some fraction of the comic endowment gets used in ways that I approve of.

But what if most of it is things like … copies of me with my hand wired up to hit the reward button ten times a second, my face frozen into a permanent grin, while drugged up on a substance that increases serotonin levels, which correlated with “happiness” in the training environment, but when optimized subjectively amounts to an unpleasant form of manic insanity? That is, what if some parts of “diverse” are Actually Bad? I would really rather not roll the dice on this, if I had the choice!

Zack_M_Davis comments on But What’s Your *New Alignment Insight,* out of a Future-Textbook Paragraph?

Zack_M_Davis comments on But What’s Your New Alignment Insight, out of a Future-Textbook Paragraph?