Do I understand it correctly that the approach is to adjust the decision probabilities in a way that penalizes in some way low-utility outcome and rewards high-utility outcomes?
Yes, the idea is that you switch actions more often when utility is low.
If so, do you recommend a specific way of doing so?
Yes, the decision rule is in the post.
How hard would it be to do what you propose constructively, say in the iterated Prisoner’s dilemma with a bunch of bots following this strategy, and see if they converge to a Nash equilibrium, hopefully the best one?
This doesn’t guarantee a Pareto efficient outcome in games where players have different utility functions (see “Extension to non-common-payoff games”). Running these strategies in iterated games is difficult given that there is an infinite number of “actions” (policies). It’s plausible that this approach yields something like ϵ-approximate correlated equilibria but I haven’t worked out the details, and anyway I’d rather try to find something that gets Pareto efficient outcomes.
Yes, the idea is that you switch actions more often when utility is low.
Yes, the decision rule is in the post.
This doesn’t guarantee a Pareto efficient outcome in games where players have different utility functions (see “Extension to non-common-payoff games”). Running these strategies in iterated games is difficult given that there is an infinite number of “actions” (policies). It’s plausible that this approach yields something like ϵ-approximate correlated equilibria but I haven’t worked out the details, and anyway I’d rather try to find something that gets Pareto efficient outcomes.