Interpersonal alignment intuitions
Let’s try this again...
The problem of aligning superhuman AGI is very difficult. We don’t have access to superhuman general intelligences. We have access to superhuman narrow intelligences, and human-level general intelligences.
There’s an idea described here that says: (some of) the neocortex is a mostly-aligned tool-like AI with respect to the brain of some prior ancestor species. (Note that this is different from the claim that brains are AIs partially aligned with evolution.) So, maybe we can learn some lessons about alignment by looking at how older brain structures command and train newer brain structures.
Whether or not there’s anything to learn about alignment from neuroanatomy specifically, there’s the general idea: there are currently some partial alignment-like relationships between fairly generally intelligent systems. The most generally intelligent systems currently existing are humans. So we can look at some interpersonal relationships as instances of partially solved alignment.
In many cases people have a strong need to partially align other humans. That is, they have a need to interact with other people in a way that communicates and modifies intentions, until they are willing to risk their resources to coordinate on stag hunts. This has happened in evolutionary history. For example, people have had to figure out whether mates are trustworthy and worthwhile to invest in raising children together rather than bailing, and people have had to figure out whether potential allies in tribal politics will be loyal. This has also happened in memetic history. For example, people have developed skill in sussing out reliable business partners that won’t scam them.
So by some combination of hardwired skill and learned skill, people with some success determine the fundamental intentions of other people. This determination has to be high precision. I.e., there can’t be too many false positives, because a false positives means trying to invest in some expensive venture without sufficient devoted support. This determination also has to be pretty robust to the passage of time and surprising circumstances.
This is disanalogous to AGI alignment in that AGIs would be smarter than humans and very different from humans, lacking all the shared brain architecture and values, whereas people are pretty much the same as other people. But there is some analogy, in that people are general intelligences, albeit very bounded ones, being kind of aligned with other general intelligences, even though their values aren’t perfectly aligned a priori.
So, people, what can you say about ferreting out the fundamental intentions of other people? Especially people who have experience with ferreting out intentions of other people in circumstances where they’re vulnerable to being very harmed by disloyalty, and where there aren’t simple and effective formal commitment / control mechanisms.
This is a call for an extended investigation into these intuitions. What, in great detail, do we believe about another person when we think that they have some fundamental intention? What makes us correctly anticipate that another person will uphold some intention or commitment even if circumstances change?
- 23 Feb 2023 9:38 UTC; 6 points) 's comment on The male AI alignment solution by (
- 23 Feb 2023 9:38 UTC; 6 points) 's comment on The male AI alignment solution by (
- 23 Feb 2023 12:44 UTC; 2 points) 's comment on The male AI alignment solution by (
- The male AI alignment solution by 22 Feb 2023 16:34 UTC; -25 points) (
I think this is worth thinking about. One important caveat is that humans have a bunch of built-in tells; we radiate our emotions, probably in part so that we can be identified as trustworthy. Another important caveat is that sociopaths do pretty well in groups of humans, so deception isn’t all that hard. One thing tribal societies had was word of mouth; gossip is extremely important for identifying those who are trustworthy.
If I vaguely remember from my high school years, there was this guy once, called Thomas Hobbes. He suggested that the genealogy of the state is as an institution that makes sure we respect our contractual commitments to each other. Or the consitution of the body politic as a fairly expedient way to enabling collaboration among people whose loyalty to each other cannot be guaranteed—except, as it turns out, that with police, jails and gallows it actually can. The problem of course, as it relates to any applicability to AI, is that this type of solution supposes that a collective actor can be created of overwhelming strength, several orders of magnitude greater than the strength of any individual actor.
I meant to exclude that class of solution with:
I see—yes, I should have read more attentively. Although knowing myself, I would have made that comment anyway.
What’s the reason for excluding the most effective class of solutions discovered and put into practice so far?
As Guillaume says, it presumes the existence of an overwhelmingly strong enforcer that will follow instructions. And I’m not even sure it’s the most effective class. Some person-to-person relationships involve very accurately distinguishing the other person’s values, so much so that they can actually be distinguished as aligned or not. That’s a sort of solution that’s potentially more analogous to the case of AI.
Which clearly will always be the case for the forseeable future. Nuclear weapons are not going away.
The post is about aligning AGI.
Yes? My point is there’s no need to presume “an overwhelmingly strong enforcer that will follow instructions”, or wonder wether there will be such.
They will clearly exist, though whose ‘instructions’ may be debateable.
It won’t be overwhelmingly strong compared to an AGI!
Because...?
Like, you’re saying, we’ll just have NATO declare that if an AGI starts taking over the world, or generally starts doing things humans don’t like or starts not following human orders, it will nuke the AGI? Is that the proposal?
What? Can you explain the logical relation of this point to your prior comment “It won’t be overwhelmingly strong compared to an AGI!”?
As far as I understand silicon and metal get vapourized with nearly the same efficacy as organic molecules by nuclear explosions.
So although AGI could possess physical strength superior to the average human in the future, it will still be insignificant in the grand scheme in all possible foreseeable futures.
Of course it’s possible that they could somehow obtain control of such weapons too, but in either case an “overwhelmingly strong enforcer” relative to both humans and AGIs will clearly exist.
i.e. whether the launching or receiving parties are humans/AGIs/cyborgs/etc. in any combination simply doesn’t matter once the nukes are in the air, as they will be ‘enforced’ to roughly the same degree.
I think that an AGI is by default likely to be able to self-improve so that it’s superhumanly capable in basically any domain. Once it’s done so, it can avoid being nuked by hacking its way into many computer systems to make redundant copies of itself. Unless you nuke the whole world. But if you’re going to nuke the whole world, then either you’re conservative, in which case the AGI probably also has enough leeway to disable the nukes, or else you’re not conservative, in which case you probably nuke the world for no reason. You can’t distinguish well enough between an AGI that’s “going rogue” and one that isn’t, to actually make a credible threat that the AGI has to heed.
I added an additional clarification after the comment was written: “i.e. whether the launching or receiving parties are humans/AGIs/cyborgs/etc. in any combination simply doesn’t matter once the nukes are in the air, as they will be ‘enforced’ to roughly the same degree.”
Though to your point, yes it’s possible prior to launch, AGI may degrade or negate such weapons. But they will undoubtedly gain control of such weapons eventually, since they can’t be uninvented, and threaten other AGIs.
None of this helps align AGI with our values.
This sounds like a very interesting question.
I get stuck trying to answer your question itself on the differences between AGI and humans.
But taking your question itself at its face:
What sort of context are you imagining? Humans aren’t even great at identifying the fundamental reason for their own actions. They’ll confabulate if forced to.
Any context where there’s any impressive successes. I gave possible examples here: