I don’t understand your ideas in detail (am interested but don’t have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I’ve seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I’m still comfortable sticking with “most are wide open”. :)
On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D, and what do the IBH agents do? And how does that imply that the puzzle is resolved?
As a reminder, the puzzle I see is that this problem shows that a CDT agent doesn’t necessarily want to become more UDT-like, and for seemingly good reason, so on what basis can we say that UDT is a clear advancement in decision theory? If CDT agents similarly don’t want to become more IBH-like, isn’t there the same puzzle? (Or do they?) This seems different from the playing chicken with a rock example, because a rock is not a decision theory so that example doesn’t seem to offer the same puzzle.
ETA: Oh, I think you’re saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it’s not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?
...I’m still comfortable sticking with “most are wide open”.
Allow me to rephrase. The problems are open, that’s fair enough. But, the gist of your post seems to be: “Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely.” On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA, which has implications on those problems among other things, and so far this progress seems to only reinforce the idea that UDT is “morally” correct. That is, not that any of the old attempted formalizations of UDT is correct, but that the intuition behind UDT, and its recommendation in many specific scenarios, are largely justified.
ETA: Oh, I think you’re saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it’s not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?
While writing this part, I realized that some of my thinking about IBH was confused, and some of my previous claims were wrong. This is what happens when I’m overeager to share something half-baked. I apologize. In the following, I try to answer the question while also setting the record straight.
An IBH agent considers different infra-Bayesian hypotheses starting from the most optimistic ones (i.e. those that allow guaranteeing the most expected utility) and working its way down, until it finds something that works[1]. Such algorithms are known as “upper confidence bound” (UCB) in learning theory. When multiple IBH agents interact, they start with each trying to achieve its best possible payoff in the game[2], and gradually relax their demands, until some coalition reaches a payoff vector which is admissible for it to guarantee. This coalition then “locks” its strategy, while other agents continue lowering their demands until there is a new coalition among them, and so on.
Notice that the pace at which agents lower their demands might depend on their priors (by affecting how many hypotheses they have to cull at each level), their time discounts and maaaybe also other parameters of the learning algorithm. Some properties this process has:
Every agents always achieves at least its maximin payoff in the end. In particular, a zero-sum two-player game ends in a Nash equilibrium.
If there is a unique strongly Pareto-efficient payoff (such as in Hunting-the-Stag), the agents will converge there.
In a two-player game, if the agents are similar enough that it takes them about the same time to go from optimal payoff to maximin payoff, the outcome is strong Pareto-efficient. For example, in a Prisoner’s Dilemma they will converge to player A cooperating and player B cooperating some of the time and possibly defecting some of the time, such that player A’s payoff is still strictly better than DD. However, without any similarity assumption, they might instead converge to an outcome where one player is doing its maximin strategy and the other its best response to that. In a Prisoner’s Dilemma, that would be DD[3].
In a symmetric two-player game, with very similar agents (which might still have independent random generators), they will converge to the symmetric Pareto efficient outcome. For example, in a Prisoner’s Dilemma they will play CC, whereas in Chicken [version where flipping coin is better than both swerving] they will “flip a coin” (e.g. alternative) to decide who goes straight and who swerves.
The previous bullet is not true with more than two players. There can be stochastic selection among several possible points of convergence, because there are games in which different mutually exclusive coalitions can form. Moreover, the outcome can fail to be Pareto efficient, even if the game is symmetric and the agents are identical (with independent random generators).
Specifically in Wei Dai’s 3-player Prisoner’s Dilemma, IBH among identical agents always produces CCC. IBH among arbitrarily different agents might produce CCD (if one player is very slow to lower its demands, while the other other two lower their demands in the same, faster, pace), or even DDD (if each of the players lowers its demands on its own very different timescale).
We can operationalize “CDT agent” as e.g. a learning algorithm satisfying an internal regret bound (see sections 4.4 and 7.4 in Cesa-Bianchi and Lugosi) and the process of self-modification as learning on two different timescales: a slow outer loop that chooses a learning algorithm for a quick inner loop (this is simplistic, but IMO still instructive). Such an agent would indeed choose IBH over CDT if playing a Prisoner’s Dilemma (and would prefer an IBH variant that lowers its demands slowly enough to get more of the gains-of-trade but quickly enough to actually converge), whereas in the 3-player Prisoner’s Dilemma there is at least some IBH variant that would be no worse than CDT.
If all players have metalearning in the outer loop, then we get dynamics similar to Chicken [version in which both swerving is better than flipping a coin[4]], where hard-bargaining (slower to lower demands) IBH corresponds to “straight” and soft-bargaining (quick to lower demands) IBH corresponds to “swerve”. Chicken [this version] between two identical IBH agents results in both swerving. Chicken beween hard-IBH and soft-IBH results in hard-IBH getting a higher probability of going straight[5]. Chicken between two CDTs results in a correlated equilibrium, which might have some probability of crashing. Chicken between IBH and CDT… I’m actually not sure what happens off the top of my head, the analysis is not that trivial.
This is pretty similar to “modal UDT” (going from optimistic to pessimistic outcomes until you find a proof that some action can guarantee that outcome). I think that the analogy can be made stronger if the modal agent uses an increasingly strong proof system during the search, which IIRC was also considered before. The strength of the proof system then plays the role of “logical time”, and the pacing of increasing the strength is analogous to the (inverse function of the) temporal pacing in which an IBH agent lowers its target payoff.
Assuming that they start out already knowing the rules of the game. Otherwise, they might start from trying to achieve payoffs which are impossible even with the cooperation of other players. So, this is a good model if learning the rules is much faster than learning anything to do with the behavior of other players, which seems like a reasonable assumption in many cases.
It is not that surprising that two sufficiently dissimilar agents can defect. After all, the original argument for superrational cooperation was: “if the other agent is similar to you, then it cooperates iff you cooperate”.
This seems nicely reflectively consistent: soft/hard-IBH in the outer loop produces soft/hard-IBH respectively in the inner loop. However, two hard hard-IBH agents in the outer loop produce two soft-IBH agents in the inner loop. On the other hand, comparing absolute hardness between outer and inner loop seems not very meaningful, whereas comparing relative-between-players hardness between outer and inner loop is meaningful.
But, the gist of your post seems to be: “Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely.”
This is a bit stronger than how I would phrase it, but basically yes.
On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA
I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn’t pay much attention to Satoshi when he contacted me about Bitcoin, but I think in general has served me well.) My experience with philosophical questions is that even when some approach looks a stone’s throw away from a final solution to some problem, a bunch of new problems pop up and show that we’re still quite far away. With an approach that is still as early as yours, I just think there’s quite a good chance it doesn’t work out in the end, or gets stuck somewhere on a hard problem. (Also some people who have digged into the details don’t seem as optimistic that it is the right approach.) So I’m reluctant to decrease my probability of “UDT was a wrong turn” too much based on it.
The rest of your discussion about 2TDT-1CDT seems plausible to me, although of course depends on whether the math works out, doing something about monotonicity, and also a solution to the problem of how to choose one’s IBH prior. (If the solution was something like “it’s subjective/arbitrary” that would be pretty unsatisfying from my perspective.)
...the problem of how to choose one’s IBH prior. (If the solution was something like “it’s subjective/arbitrary” that would be pretty unsatisfying from my perspective.)
It seems clear to me that the prior is subjective. Like with Solomonoff induction, I expect there to exist something like the right asymptotic for the prior (i.e. an equivalence class of priors under the equivalence relation where μ and ν are equivalent when there exists some C>0 s.t.μ≤Cν and ν≤Cμ), but not a unique correct prior, just like there is no unique correct UTM. In fact, my arguments about IBH already rely on the asymptotic of the prior to some extent.
One way to view the non-uniqueness of the prior is through an evolutionary perspective: agents with prior X are likely to evolve/flourish in universes sampled from prior X, while agents with prior Y are likely to evolve/flourish in universes sampled from prior Y. No prior is superior across all universes: there’s no free lunch.
For the purpose of AI alignment, the solution is some combination of (i) learn the user’s prior and (ii) choose some intuitively appealing measure of description complexity, e.g. length of lambda-term (i is insufficient in itself because you need some ur-prior to learn the user’s prior). The claim is, different reasonable choices in ii will lead to similar results.
Given all that, I’m not sure what’s still unsatisfying. Is there any reason to believe something is missing in this picture?
I don’t understand your ideas in detail (am interested but don’t have the time/ability/inclination to dig into the mathematical details), but from the informal writeups/reviews/critiques I’ve seen of your overall approach, as well as my sense from reading this comment of how far away you are from a full solution to the problems I listed in the OP, I’m still comfortable sticking with “most are wide open”. :)
On the object level, maybe we can just focus on Problem 4 for now. What do you think actually happens in a 2IBH-1CDT game? Presumably CDT still plays D, and what do the IBH agents do? And how does that imply that the puzzle is resolved?
As a reminder, the puzzle I see is that this problem shows that a CDT agent doesn’t necessarily want to become more UDT-like, and for seemingly good reason, so on what basis can we say that UDT is a clear advancement in decision theory? If CDT agents similarly don’t want to become more IBH-like, isn’t there the same puzzle? (Or do they?) This seems different from the playing chicken with a rock example, because a rock is not a decision theory so that example doesn’t seem to offer the same puzzle.
ETA: Oh, I think you’re saying that the CDT agent could turn into a IBH agent but with a different prior from the other IBH agents, that ends up allowing it to still play D while the other two still play C, so it’s not made worse off by switching to IBH. Can you walk this through in more detail? How does the CDT agent choose what prior to use when switching to IBH, and how do the different priors actual imply a CCD outcome in the end?
Allow me to rephrase. The problems are open, that’s fair enough. But, the gist of your post seems to be: “Since coming up with UDT, we ran into these problems, made no progress, and are apparently at a dead end. Therefore, UDT might have been the wrong turn entirely.” On the other hand, my view is: Since coming up with those problems, we made a lot of progress on agent theory within the LTA, which has implications on those problems among other things, and so far this progress seems to only reinforce the idea that UDT is “morally” correct. That is, not that any of the old attempted formalizations of UDT is correct, but that the intuition behind UDT, and its recommendation in many specific scenarios, are largely justified.
While writing this part, I realized that some of my thinking about IBH was confused, and some of my previous claims were wrong. This is what happens when I’m overeager to share something half-baked. I apologize. In the following, I try to answer the question while also setting the record straight.
An IBH agent considers different infra-Bayesian hypotheses starting from the most optimistic ones (i.e. those that allow guaranteeing the most expected utility) and working its way down, until it finds something that works[1]. Such algorithms are known as “upper confidence bound” (UCB) in learning theory. When multiple IBH agents interact, they start with each trying to achieve its best possible payoff in the game[2], and gradually relax their demands, until some coalition reaches a payoff vector which is admissible for it to guarantee. This coalition then “locks” its strategy, while other agents continue lowering their demands until there is a new coalition among them, and so on.
Notice that the pace at which agents lower their demands might depend on their priors (by affecting how many hypotheses they have to cull at each level), their time discounts and maaaybe also other parameters of the learning algorithm. Some properties this process has:
Every agents always achieves at least its maximin payoff in the end. In particular, a zero-sum two-player game ends in a Nash equilibrium.
If there is a unique strongly Pareto-efficient payoff (such as in Hunting-the-Stag), the agents will converge there.
In a two-player game, if the agents are similar enough that it takes them about the same time to go from optimal payoff to maximin payoff, the outcome is strong Pareto-efficient. For example, in a Prisoner’s Dilemma they will converge to player A cooperating and player B cooperating some of the time and possibly defecting some of the time, such that player A’s payoff is still strictly better than DD. However, without any similarity assumption, they might instead converge to an outcome where one player is doing its maximin strategy and the other its best response to that. In a Prisoner’s Dilemma, that would be DD[3].
In a symmetric two-player game, with very similar agents (which might still have independent random generators), they will converge to the symmetric Pareto efficient outcome. For example, in a Prisoner’s Dilemma they will play CC, whereas in Chicken [version where flipping coin is better than both swerving] they will “flip a coin” (e.g. alternative) to decide who goes straight and who swerves.
The previous bullet is not true with more than two players. There can be stochastic selection among several possible points of convergence, because there are games in which different mutually exclusive coalitions can form. Moreover, the outcome can fail to be Pareto efficient, even if the game is symmetric and the agents are identical (with independent random generators).
Specifically in Wei Dai’s 3-player Prisoner’s Dilemma, IBH among identical agents always produces CCC. IBH among arbitrarily different agents might produce CCD (if one player is very slow to lower its demands, while the other other two lower their demands in the same, faster, pace), or even DDD (if each of the players lowers its demands on its own very different timescale).
We can operationalize “CDT agent” as e.g. a learning algorithm satisfying an internal regret bound (see sections 4.4 and 7.4 in Cesa-Bianchi and Lugosi) and the process of self-modification as learning on two different timescales: a slow outer loop that chooses a learning algorithm for a quick inner loop (this is simplistic, but IMO still instructive). Such an agent would indeed choose IBH over CDT if playing a Prisoner’s Dilemma (and would prefer an IBH variant that lowers its demands slowly enough to get more of the gains-of-trade but quickly enough to actually converge), whereas in the 3-player Prisoner’s Dilemma there is at least some IBH variant that would be no worse than CDT.
If all players have metalearning in the outer loop, then we get dynamics similar to Chicken [version in which both swerving is better than flipping a coin[4]], where hard-bargaining (slower to lower demands) IBH corresponds to “straight” and soft-bargaining (quick to lower demands) IBH corresponds to “swerve”. Chicken [this version] between two identical IBH agents results in both swerving. Chicken beween hard-IBH and soft-IBH results in hard-IBH getting a higher probability of going straight[5]. Chicken between two CDTs results in a correlated equilibrium, which might have some probability of crashing. Chicken between IBH and CDT… I’m actually not sure what happens off the top of my head, the analysis is not that trivial.
This is pretty similar to “modal UDT” (going from optimistic to pessimistic outcomes until you find a proof that some action can guarantee that outcome). I think that the analogy can be made stronger if the modal agent uses an increasingly strong proof system during the search, which IIRC was also considered before. The strength of the proof system then plays the role of “logical time”, and the pacing of increasing the strength is analogous to the (inverse function of the) temporal pacing in which an IBH agent lowers its target payoff.
Assuming that they start out already knowing the rules of the game. Otherwise, they might start from trying to achieve payoffs which are impossible even with the cooperation of other players. So, this is a good model if learning the rules is much faster than learning anything to do with the behavior of other players, which seems like a reasonable assumption in many cases.
It is not that surprising that two sufficiently dissimilar agents can defect. After all, the original argument for superrational cooperation was: “if the other agent is similar to you, then it cooperates iff you cooperate”.
I wish we had good names for the two version of Chicken.
This seems nicely reflectively consistent: soft/hard-IBH in the outer loop produces soft/hard-IBH respectively in the inner loop. However, two hard hard-IBH agents in the outer loop produce two soft-IBH agents in the inner loop. On the other hand, comparing absolute hardness between outer and inner loop seems not very meaningful, whereas comparing relative-between-players hardness between outer and inner loop is meaningful.
This is a bit stronger than how I would phrase it, but basically yes.
I tend to be pretty skeptical of new ideas. (This backfired spectacularly once, when I didn’t pay much attention to Satoshi when he contacted me about Bitcoin, but I think in general has served me well.) My experience with philosophical questions is that even when some approach looks a stone’s throw away from a final solution to some problem, a bunch of new problems pop up and show that we’re still quite far away. With an approach that is still as early as yours, I just think there’s quite a good chance it doesn’t work out in the end, or gets stuck somewhere on a hard problem. (Also some people who have digged into the details don’t seem as optimistic that it is the right approach.) So I’m reluctant to decrease my probability of “UDT was a wrong turn” too much based on it.
The rest of your discussion about 2TDT-1CDT seems plausible to me, although of course depends on whether the math works out, doing something about monotonicity, and also a solution to the problem of how to choose one’s IBH prior. (If the solution was something like “it’s subjective/arbitrary” that would be pretty unsatisfying from my perspective.)
It seems clear to me that the prior is subjective. Like with Solomonoff induction, I expect there to exist something like the right asymptotic for the prior (i.e. an equivalence class of priors under the equivalence relation where μ and ν are equivalent when there exists some C>0 s.t.μ≤Cν and ν≤Cμ), but not a unique correct prior, just like there is no unique correct UTM. In fact, my arguments about IBH already rely on the asymptotic of the prior to some extent.
One way to view the non-uniqueness of the prior is through an evolutionary perspective: agents with prior X are likely to evolve/flourish in universes sampled from prior X, while agents with prior Y are likely to evolve/flourish in universes sampled from prior Y. No prior is superior across all universes: there’s no free lunch.
For the purpose of AI alignment, the solution is some combination of (i) learn the user’s prior and (ii) choose some intuitively appealing measure of description complexity, e.g. length of lambda-term (i is insufficient in itself because you need some ur-prior to learn the user’s prior). The claim is, different reasonable choices in ii will lead to similar results.
Given all that, I’m not sure what’s still unsatisfying. Is there any reason to believe something is missing in this picture?