“get smarter” is not optimization pressure (though there is evidence that higher IQ and more education is correlated with smaller families). If you have important goals at risk, would you harm your family (using “harm” rather than “stop loving”, as alignment is about actions, not feelings)? There are lots of examples of humans doing so. Rephrasing it as “can Moloch break this alignment?” may help.
That said, I agree it’s a fully-general objection, and I can’t tell whether it’s legitimate (alignment researchers need to explore and model the limits of tradeoffs in adversarial or pathological environments for any proposed utility function or function generator) or meaningless (can be decomposed into specifics which are actually addressed).
I kind of lean toward “legitimate”, though. Alignment may be impossible over long timeframes and significant capability differentials.
“get smarter” is not optimization pressure (though there is evidence that higher IQ and more education is correlated with smaller families). If you have important goals at risk, would you harm your family (using “harm” rather than “stop loving”, as alignment is about actions, not feelings)? There are lots of examples of humans doing so. Rephrasing it as “can Moloch break this alignment?” may help.
That said, I agree it’s a fully-general objection, and I can’t tell whether it’s legitimate (alignment researchers need to explore and model the limits of tradeoffs in adversarial or pathological environments for any proposed utility function or function generator) or meaningless (can be decomposed into specifics which are actually addressed).
I kind of lean toward “legitimate”, though. Alignment may be impossible over long timeframes and significant capability differentials.