Richard_Ngo comments on Commitment and credibility in multipolar AI scenarios

Richard_Ngo 7 Feb 2021 8:10 UTC
6 points
Nice post! A couple of quick comments:

“If interactions are repeated in an environment where the stakes get higher over time, most agents would prefer to be honest while the stakes are low, regardless of how they will act in a sufficiently high-stakes situation.”

If honesty in low-stakes situations is very weak evidence of honesty in high-stakes situations, then it will become less common as an instrumental strategy, which makes it stronger evidence, until it reaches equilibrium.

More generally, I am pretty curious about how reputational effects work when you have a very wide range of minds. The actual content of the signal can be quite arbitrary—e.g. it’s possible to imagine a world in which it’s commonly understood that lying continually about small scales is intended as a signal of the intention to be honest about large scales. Once that convention is in place, then it could be self-perpetuating.

This is a slightly extreme example but the general point remains: actions taken as signalling can be highly arbitrary (see runaway sexual selection for example) when they’re not underpinned by specific human mental traits (like the psychological difficulty of switching between honesty and lying).

“This holds especially because the higher the stakes get in a competition for expansion, the fewer future interactions one expects, as wiping out other players entirely becomes a possible outcome.”

Seems plausible, but note that early iterated interactions allows participant to steer towards possibilities where important outcomes are decided by many small interactions rather than a few large interactions, making long-term honesty more viable.

“lending an expensive camera to one’s sibling seems less risky than to a stranger simply because of the high likelihood of frequent future interactions”

This doesn’t seem right; your sibling is by default more aligned and trustworthy.

“while the agents can’t interpret each other or predict how well they would stick to commitments, a far more capable system (here, likely just a system with vastly more compute at its disposal) could do it for them.”

Is it fair to describe this as creating a singleton?