Stuart_Armstrong comments on Cooperating with agents with different ideas of fairness, while resisting exploitation

Stuart_Armstrong 16 Sep 2013 9:51 UTC
5 points
Solution concept implementing this approach (as I understand it):

Player X chooses Pareto fair outcome (X→X, X→Y), (X→Y can be read as “player X’s fair utility assignment to player Y”), player Y chooses fair outcome (Y→X, Y→Y).

The actual outcome is (Y→X, X→Y)

(If you have a visual imagination in maths, as I do, you can see this graphically as the Pareto maximum among all the points Pareto worse than both fair outcomes).

This should be unexploitable in some senses, as you’re not determining your own outcome, but only that of the other player.

Since it’s not Pareto, it’s still possible to negotiate over possible improvements (“if I change my idea of fairness towards the middle, will you do it too?”) and blackmail is possible in that negotiation process. Interesting idea, though.
What links here?
- Responses to apparent rationalist confusions about game / decision theory by Anthony DiGiovanni (30 Aug 2023 22:02 UTC; 142 points)
- Individually incentivized safe Pareto improvements in open-source bargaining by Nicolas Macé (17 Jul 2024 18:26 UTC; 39 points)
- wedrifid 17 Sep 2013 16:42 UTC
  15 points
  Parent
  Conclusion: Stuart’s solution is flawed because it fails to blackmail pirates appropriately.
  
  Thoughts:
  - Eliezer’s solution matched my intuitions for how negotiation feels like it ‘should’ work.
  - Analyzing Stuart’s solution and accompanying diagram changed my mind.
  - Stuart’s solution does Pareto-dominate Eliezer’s.
  - There is no incentive for either player to deviate from Stuart’s solution.
  - Unfortunately, ‘no incentive to deviate’ is not sufficient for creating stable compliance even among perfectly rational agents, let alone even slightly noisy agents.
  - When the other agent receives an identical payoff for giving me low utility as it does for giving me high utility then the expected behaviour of a rational opponent is approximately undefined. It’s entirely arbitrary.
  - A sane best practice would be to assume that of all outcomes with equal utility (to them) the other agent will probably choose the action that screws me over the most.
  - At very best we could say that we are granting the other agent the power to punish me for free on a whim—for most instrumental purposes this is a bad thing.
  - Consider a decision algorithm that, when evaluating the desirability of outcomes, first sorts by utility and then reverse-sorts by utility-for-other. In honour of the Pirate game I will call agents implementing that algorithm “pirates”. (The most alternative name would be ‘assholes’.)
  - Pirates are rational agents in the same sense as usually used for game theory purposes. They simply have defined behaviour in the place where ‘rational’ was previously undefined.
  - Eliezer’s prescribed negative incentive for each degree of departure from ‘fair’ ensures that pirates behave themselves, even if the punishment factor is tiny.
  - Eliezer’s punishment policy also applies (and is necessary) when dealing with what we could call “petty sadists”. That is, for agents which actually have utility functions with a small negative term for the utility granted to the other.
  - Usually considering things like petty sadism and ‘pirates’ is beyond the scope of a decision theory problem and it would be inappropriate to mention them. But when a proposed solution offers literally zero incentive to granting the payoff then these considerations become relevant. Even the slightest amount of noise in an agent, the communication or a utility function can flip the behaviour about. “Epsilon” stops being negligible when you try comparing it to ‘zero’.
  - Using Eliezer’s punishment solution instead of Stuart’s seems to be pure blackmail.
  - While many cases of blackmail I reject with unshakable stubbornness I think one of the most clear exceptions is the case where complying costs me nothing at all and the blackmail cost nothing or next-to nothing for the blackmailer.
  - At a limit of sufficiently intelligent agents with perfect exchange of decision algorithm source code (utility-function source code not required) rational agents implementing Eliezer’s punishment-for-unfairness system will arrive at punishment factors approaching zero and the final decision will approach Stuart’s Pareto-dominant solution.
  - When there is mutual trust in the decision algorithms of the other agents or less trust in the communication process then a greater amount of punishment for unfairness is desirable.
  - Punishing unfairness is the ‘training wheels’ of cooperation between agents with different ideas of fairness.
  What links here?
  - wedrifid's comment on Cooperating with agents with different ideas of fairness, while resisting exploitation by Eliezer Yudkowsky (17 Sep 2013 16:46 UTC; 1 point)
  - Eliezer Yudkowsky 17 Sep 2013 20:34 UTC
    6 points
    Parent
    
    Using Eliezer’s punishment solution instead of Stuart’s seems to be pure blackmail.
    
    At a limit of sufficiently intelligent agents with perfect exchange of decision algorithm source code (utility-function source code not required) rational agents implementing Eliezer’s punishment-for-unfairness system will arrive at punishment factors approaching zero and the final decision will approach Stuart’s Pareto-dominant solution.
    
    When there is mutual trust in the decision algorithms of the other agents or less trust in the communication process then a greater amount of punishment for unfairness is desirable.
    
    My intuition is more along the lines of:
    
    Suppose there’s a population of agents you might meet, and the two of you can only bargain by simultaneously stating two acceptable-bargain regions and then the Pareto-optimal point on the intersection of both regions is picked. I would intuitively expect this to be the result of two adapted Masquerade algorithms facing each other.
    
    Most agents think the fair point is N and will refuse to go below unless you do worse, but some might accept an exploitive point of N’. The slope down from N has to be steep enough that having a few N’-accepting agents will not provide a sufficient incentive to skew your perfectly-fair point away from N, so that the global solution is stable. If there’s no cost to destroying value for all the N-agents, adding a single exploitable N’-agent will lead each bargaining agent to have an individual incentive to adopt this new N’-definition of fairness. But when two N’-agents meet (one reflected) their intersection destroys huge amounts of value. So the global equilibrium is not very Nash-stable.
    
    Then I would expect this group argument to individualize over agents facing probability distributions of other agents.
    - wanderingsoul 17 Sep 2013 23:57 UTC
      2 points
      Parent
      I’m not getting what you’re going for here. If these agents actually change their definition of fairness based on other agents definitions then they are trivially exploitable. Are there two separate behaviors here, you want unexploitability in a single encounter, but you still want these agents to be able to adapt their definition of “fairness” based on the population as a whole?
      - wedrifid 18 Sep 2013 3:26 UTC
        2 points
        Parent
        
        If these agents actually change their definition of fairness based on other agents definitions then they are trivially exploitable.
        
        I’m not sure that is trivial. What is trivial is that some kinds of willingness to change their definition of fairness makes them exploitable. However this doesn’t hold for all kinds of willingness to change fairness definition. Some agents may change their definition of fairness in their favour for the purpose of exploiting agents vulnerable to this tactic but not willing to change their definition of fairness when it harms them. The only ‘exploit’ here is ‘prevent them from exploiting me and force them to use their default definition of fair’.
        wanderingsoul 18 Sep 2013 3:55 UTC
        1 point
        Parent
        Ah, that clears this up a bit. I think I just didn’t notice when N’ switched from representing an exploitive agent to an exploitable one. Either that, or I have a different association for exploitive agent than what EY intended. (namely, one which attempts to exploit)
- Eliezer Yudkowsky 16 Sep 2013 17:41 UTC
  6 points
  Parent
  This does not sound like what I had in mind. You pick a series of increasingly unfair-to-you, increasingly worse-for-the-other-player outcomes whose first element is what you deem the fair Pareto outcome: (100, 100), (98, 99), (96, 98), and stop well short of Nash and then drop to Nash. The other does the same. Unless one of you has a completely skewed idea of fairness, you should be able to meet somewhere in the middle. Both of you will do worse against a fixed opponent’s strategy by unilaterally adopting more self-favoring ideas of fairness. Both of you will do worse in expectation against potentially exploitive opponents by unilaterally adopting looser ideas of fairness. This gives everyone an incentive to obey the Galactic Schelling Point and be fair about it.
  - Stuart_Armstrong 16 Sep 2013 20:31 UTC
    3 points
    Parent
    My solution Pareto-dominates that approach, I believe. It’s precisely the best you can do, given that each player cannot win more than what the other thinks their “fair share” is.
    - wanderingsoul 17 Sep 2013 5:39 UTC
      8 points
      Parent
      I tried to generalize Eliezer’s outcomes to functions, and realized if both agents are unexploitable, the optimal functions to pick would lead to Stuart’s solution precisely. Stuart’s solution allows agents to arbitrarily penalize the other though, which is why I like extending Eliezer’s concept better. Details below, P.S. I tried to post this in a comment above, but in editing it I appear to have somehow made it invisible, at least to me. Sorry for repost if you can indeed see all the comments I’ve made.
      
      It seems the logical extension of your finitely many step-downs in “fairness” would be to define a function f(your_utility) which returns the greatest utility you will accept the other agent receiving for that utility you receive. The domain of this function should run from wherever your magical fairness point is down to the Nash equilibrium. As long as it is monotonically increasing, that should ensure unexploitability for the same reasons your finite version does. The offer both agents should make is at the greatest intersection point of these functions, with one of them inverted to put them on the same axes. (This intersection is guaranteed to exist in the only interesting case, where the agents do not accept as fair enough each other’s magical fairness point)
      
      Curiously, if both agents use this strategy, then both agents seem to be incentivized to have their function have as much “skew” (as EY defined it in clarification 2) as possible, as both functions are monotonically increasing so decreasing your opponents share can only decrease your own. Asymptotically and choosing these functions optimally, this means that both agents will end up getting what the other agent thinks is fair, minus a vanishingly small factor!
      
      Let me know if my reasoning above is transparent. If not, I can clarify, but I’ll avoid expending the extra effort revising further if what I already have is clear enough. Also, just simple confirmation that I didn’t make a silly logical mistake/post something well known in the community already is always appreciated.
      - wedrifid 17 Sep 2013 16:46 UTC
        1 point
        Parent
        
        I tried to generalize Eliezer’s outcomes to functions, and realized if both agents are unexploitable, the optimal functions to pick would lead to Stuart’s solution precisely. Stuart’s solution allows agents to arbitrarily penalize the other though, which is why I like extending Eliezer’s concept better.
        
        I concur, my reasoning likely overlaps in parts. I particularly like your observation about the asymptotic behaviour when choosing the functions optimally.
- cousin_it 16 Sep 2013 12:46 UTC
  4 points
  Parent
  If I’m determining the outcome of the other player, doesn’t that mean that I can change my “fair point” to threaten the other player with no downside for me? That might also lead to blackmail...
  - Stuart_Armstrong 16 Sep 2013 14:01 UTC
    3 points
    Parent
    Indeed! And this is especially the case if any sort of negotiations are allowed.
    
    But every system is vulnerable to that. Even the “random dictator”, which is the ideal of unexploitability. You can always say “I promise to be a better (worse) dictator if you (unless you) also promise to be better”.
  - ESRogs 17 Sep 2013 3:55 UTC
    2 points
    Parent
    If I understand correctly, what Stuart proposes is just a special case of what Eliezer proposes. EY’s scheme requires some function mapping the degree of skew in the split to the number of points you’re going to take off the total. SA’s scheme is the special case where that function is the constant zero.
    
    The more punishing function you use, the stronger incentive you create for others to accept your definition of ‘fair’, but on the other hand, if the party you’re trading with genuinely has a a different concept of ‘fair’ and if you’re both following this technique, it’d be best for both of you to use the more lenient zero-penalty approach.
    
    As far as I can tell, if you’ve reliably pre-committed to not give in to blackmail (and the other party is supposed to be able to read your source code after all), the zero-penalty approach seems to be optimal.