Lukas_Gloor comments on Ngo and Yudkowsky on alignment difficulty

Lukas_Gloor 16 Nov 2021 12:21 UTC
9 points
Comment inspired by the section “1.4 Consequentialist goals vs. deontologist goals,” as well as by the email exchange linked there:

I wonder if it would be productive to think about whether some humans are ever “aligned” to other humans, and if yes, under what conditions this happens.

My sense is that the answer’s “yes” (if it wasn’t, it makes you wonder why we should care about aligning AI to humans in the first place).
For instance, some people have a powerful desire to be seen and accepted for who they are by a caring virtuous person who inspires them to be better versions of themselves. This virtuous person could be a soulmate, parent figure or role model, or even Jesus/God. The virtue in question could be moral virtue (being caring and principled, or adopting “heroic responsibility”) or it could be epistemic (e.g., when making an argument to someone who could more easily be fooled, asking “Would my [idealized] mental model of [person held in high esteem] endorse the cognition that goes into this argument?”). In these instances, I think the desire isn’t just to be evaluated as good by some concrete other. Instead, it’s wanting to be evaluated as good by an idealized other, someone who is basically omniscient about who you are and what you’re doing.

If this sort of alignment exists among humans, we can assume that the pre-requirements for it (perhaps later to be combined with cultural strategies) must have been an attractor in our evolutionary past in the same way deceptive strategies (e.g., the dark triad phenotype) were attractors. That is, depending on biological initial conditions, and depending on cultural factors, there’s a basin of attraction toward either phenotype (presumably with lots of other deceptive attractors on the way where different flavors of self-deception mess up trustworthiness).
It’s unclear to me if any of this has bearings on the alignment discussion. But if we think that some humans are aligned to other humans, yet we are pessimistic about training AIs to be corrigible to some overseer, it seems like we should be able to point to specifics of why the latter case is different.

For context, I’m basically wondering if it makes sense to think of this corrigiblity discussion as trying to breed some alien species with selection pressures we have some control over. And while we may accept that the resulting aliens would have strange, hard-to-weed-out system-1 instincts and so on, I’m wondering if this endeavor perhaps isn’t doomed because the strategy sounds like we’d be trying to give them a deep-seated, sacred desire to do right by the lights of “good exemplars of humanity,” in a way similar as to something that actually worked okay with some humans (with respect to how they think of their role models).

(TBC, I expect most humans to fail their stated values if they end up in situations with more power than the forces of accountability around them. I’m just saying there exist humans who put up a decent fight against corruption, and this gets easier if you provide additional aides to that end, which we could do in a well-crafted selection environment.)