In some discussions (especially about acausal trade and multi-polar conflict), I’ve heard the motto “X will/won’t be a problem because superintelligences will just be Updateless”. Here I’ll explain (in layman’s terms) why, as far as we know, it’s not looking likely that a super satisfactory implementation of Updatelessness exists, nor that superintelligences automatically implement it, nor that this would drastically improve multi-agentic bargaining.
Epistemic status: These insights seem like the most robust update from my work with Demski on Logical Updatelessness and discussions with CLR employees about Open-Minded Updatelessness. To my understanding, most researchers involved agree with them and the message of this post.
What is Updatelessness?
This is skippable if you’re already familiar with the concept.
It’s easier to illustrate with the following example: Counterfactual Mugging.
I will throw a fair coin.
If it lands Heads, you will be able to freely choose whether to pay me $100 (and if so, you will receive nothing in return).
If it lands Tails, I will check whether you paid me the $100 in the Heads world[1], and if so, I will pay you $1000.
When you find yourself in the Heads world, one might argue, the rational thing to do is to not pay. After all, you already know the coin landed Heads, so you will gain nothing by paying the $100 (assume this game is not iterated, etc.).
But if, before knowing how the coin lands, someone offers you the opportunity of committing to paying up in the Heads world, you will want to accept it! Indeed, you’re still uncertain about whether you’ll end up in the Heads or the Tails world (50% chance on each). If you don’t commit, you know you won’t pay if you find yourself in the Heads world (and so also won’t receive $1000 in the Tails world), so your expected payoff is $0. But if you commit, your payoff will be -$100 in the Heads world, and $1000 in the Tails world, so $450 in expectation.
This is indeed what happens to the best-known decision theories (CDT and EDT): they want to commit to paying, but if they don’t, by the time they get to the Heads world they don’t pay. We call this dynamic instability, because different (temporal) versions of the agent seem to be working against each other.
Why does this happen? Because, before seeing the coin, the agent is still uncertain about which world it will end in, and so still “cares” about what happens in both (and this is reflected in the expected value calculation, when we include both with equal weight). But upon seeing the coin land, the agent updates on the information that it’s in the Heads world, and the Tails world doesn’t exist, and so stops “caring” about the latter.
This is not so different from our utility function changing (before we were trying to maximize it in two worlds, now only in one), and we know that leads to instability.
An updateless agent would use a decision procedure that doesn’t update on how the coin lands. And thus, even if it found itself in the Heads world, it would acknowledge its previous credences gave equal weight to both worlds, and so pay up (without needing to have pre-committed to do so), because this was better from the perspective of the prior. Indeed, Updatelessness is nothing more than “committing to maximize the expected value from the perspective of your prior” (instead of constantly updating your prior, so that the calculation of this expected value changes). This is not always straight-forward or well-defined (for example, what if you learn of a radically new insight that you had never considered at the time of setting your prior?), so we need to fill in more details to obtain a completely defined decision theory. But that’s the gist of it.
Updatelessness allows you to cooperate with your counterfactual selves (for example, your Heads self can cooperate with your Tails self), because you both care about each others’ worlds. Updatelessness allows for this kind of strategicness: instead of each counterfactual self doing its own thing (possibly at odds with other selves), they all work together to maximize expected utility according to the prior.
Also, between the two extremes of complete updatelessness (commit right now to a course of action forever that you’ll never revise, which is the only dynamically stable option) and complete updatefulness (basically EDT or CDT as usually presented), there’s an infinite array of decision theories which are “partially updateless”: they update on some kinds of information (so they’re not completely stable), but not others.[2]
This dichotomy between updating or not doesn’t only happen for empirical uncertainty (how the empirical coin will land): it also happens for logical/mathematical/computational uncertainty. Say, for instance, you are uncertain about whether the trillionth digit of pi is Even or Odd. We can play the same Counterfactual Mugging as above, just with this “mathematical coin” instead of an empirical coin[3].
So now, if you think you might play this game in the future, you have a reason not to learn about the trillionth digit of pi: you want to preserve your strategicness, your coordination with your counterfactuals, since that seems better in expected value from your current prior (which is uncertain about the digit, and so cares about both worlds). Indeed, if you learn (update on) the parity of this digit, and you let your future decisions depend on this information, then your two counterfactual selves (the one in the Even world and the one in the Odd world) might act differently (and maybe at odds), each only caring about their own world.
But you also have a lot of reasons to learn about the digit! Maybe doing so helps you understand math better, and you can use this to better navigate the world and achieve your goals in many different situations (some of which you cannot predict in advance). Indeed, the Value of Information theorems of academic decision theory basically state that updating on information is useful to forward your goals in many circumstances.[4]
So we seem to face a fundamental trade-off between the information benefits of learning (updating) and the strategic benefits of updatelessness. If I learn the digit, I will better navigate some situations which require this information, but I will lose the strategic power of coordinating with my counterfactual self, which is necessary in other situations.
Needless to say, this trade-off happens for each possible piece of information. Not only the parity of the trillionth digit of pi, but also that of the hundredth, the tenth, whether Fermat’s Last Theorem is true, etc. Of course, many humans have already updated on some of these pieces of information, although there is always an infinite amount of them we haven’t updated on yet.
Multi-agentic interactions
These worries might seem esoteric and far-fetched, since indeed Counterfactual Mugging seems like a very weird game that we’ll almost certainly never experience. But unfortunately, situations equivalent to this game are the norm in strategic multi-agentic interactions. And there as well we face this fundamental trade-off between learning and strategicness, giving rise to the commitment races problem. Let me summarize that problem, using the game of Chicken:
Say two players have conflicting goals. Each of them can decide whether to be very aggressive, possibly threatening conflict and trying to scare the other player, or to instead not try so hard to achieve their goal, and ensure at least that conflict doesn’t happen (since conflict would be very bad for both).
A strategy one could follow is to first see whether the other player is playing aggressive, and be aggressive iff the other is not being aggressive.
The problem is, if the other learns you will be playing this strategy, then they will play aggressive (even if at first they were wary of doing so), knowing you will simply let them have the win. If instead you had committed to play aggressive no matter what the other does (instead of following your strategy), then maybe the other would have been scared off, and you would have won (also without conflict).
What’s really happening here is that, in making your strategy depend on the other’s move (by updating on the other’s move), you are giving them power over your action, that they can use to their advantage. So here again we face the same trade-off: by updating, you at least ensure conflict doesn’t happen (because your action will be a best-possible-response to the others’ move), but you also lose your strategicness (because your action will be manipulatable by the other).
The commitment races problem is very insidious, because empirically seeing what the other has played is not the only way of updating on their move or strategy: thinking about what they might play to improve your guess about it (which is a kind of super-coarse-grained simulation), or even thinking about some basic game-theoretic incentives, can already give you information about the other player’s strategy. Which you might regret to have learned, due to losing strategicness, and the other player possibly learning or predicting this and manipulating your decision.
So one of the players might reason: “Okay, I have some very vague opinions about what the other might play, hmm, should I be aggressive or not?… Oh wait, oh fuck, if I think more about this I might stumble upon information I didn’t want to learn, thus losing strategicness. I should better commit already now (early on) to always being aggressive, that way the other will probably notice this, and get scared and best-respond to avoid entering conflict. Nice. [presses commitment button]”
This amounts to the player maximizing expected value from the perspective of their prior (their current vague opinions about the game), as opposed to learning more, updating the prior, and deciding on an action then. That is, they are being updateless instead of updateful, so as not to lose strategicness.
The problem here is, if all agents are okay with such extreme and early gambits (and have a high enough prior that their opponents will be dovish)[5], then they will all commit to be aggressive, and end up in conflict, which is the worst outcome possible.
This indeed can happen when completely Updateless agents face each other. The Updateless agent is scared to learn anything more than their prior, because this might lead to them losing strategicness, and thus being exploited by other players who moved first, who committed first to the aggressive action (that is, who were more Updateless). As Demski put it:
One way to look at what UDT (Updateless Decision Theory) is trying to do is to think of it as always trying to win a “most meta” competition. UDT doesn’t want to look at any information until it has determined the best way to use that information. [...] It wants to move first in every game. [...] It wants to announce its binding commitments before anyone else has a chance to, so that everyone has to react to the rules it sets. It wants to set the equilibrium as it chooses. Yet, at the same time, it wants to understand how everyone else will react. It would like to understand all other agents in detail, their behavior a function of itself.
So, what happens if you put two such agents in a room together? Both agents race to decide how to decide first. [...] Yet, such examination of the other needs to itself be done in an updateless way. It’s a race to make the most uninformed decision.[6]
Is the trade-off avoidable?
There have been some attempts at surpassing this fundamental trade-off, somehow reconciling learning with strategicness, epistemics with instrumentality. Somehow negotiating with my other counterfactual selves, while at the same time not losing track of my own indexical position in the multi-verse.
In fact, it seems at first pretty intuitive that some solution along these lines should exist: just learn all the information, and then decide which one to use. After all, you are not forced to use the information, right?
Unfortunately, it’s not that easy, and the problem recurs at a higher level: your procedure to decide which information to use will depend on all the information, and so you will already lose strategicness. Or, if it doesn’t depend, then you are just being updateless, not using the information in any way.
In general, these attempts haven’t come to fruition.
FDT-like decision theories don’t even engage with the learning vs updatelessness question. You need to give them a prior over how certain computations affect others[7], that is, a single time slice, a prior to maximize. A non-omniscient FDT agent playing Chicken can fall into a commitment race as much as anyone, if the prior at that time recommends commitment without thinking further.
Demski and I over-stepped this by natively implementing dynamic logical learning in our framework (using Logical Inductors). And we had some ideas to reconcile learning with strategicness. But ultimately, the framework only solidified even further the fundamentality of this trade-off, and the existence of a “free parameter”: what exactly to update on.
Diffractor is working on some promising (unpublished) algorithmic results which amount to “updateless policies not being that hard to compute, and asymptotically achieving good payoffs”… but they do assume a certain structure in the decision-theoretic environment, which basically amounts to “there’s not too much information that is counter-productive to learn”. That is, there are not that many “information traps”, analogous to the usual “exploration traps” in learning theory.
Daniel Herrmann advocates for dropping talk of counterfactuals entirely and being completely updateful.
From my perspective, Open-Minded Updatelessness doesn’t push back on this fundamental trade-off. Instead, given the trade-off persists, it explores which kinds and shapes of partial commitments seem more robustly net-positive from the perspective of our current game-theoretic knowledge (that is, our current prior). But this is a point of contention, and there are on-going debates about whether OMU could get us something more.
To be clear, I’m not saying “I and others weren’t able to solve this problem, so this problem is unsolvable” (although that’s a small update). On the contrary, the important bit is that we seem to have elucidated important reasons why “the problem” is in fact a fundamental feature of mixing learning theory with game theory.
A more static (not dynamic, like Logical Inductors) and realist picture of computational counterfactuals (or, equivalently, subjunctive dependence) would help, but we have even more evidence that such a thing shouldn’t exist in principle, and there is a dependence on the observer’s ontology[8].
Superintelligences
In summary, a main worry with Updateless agents is that, although they might face some strategic advantages (in the situations they judged correctly from their prior), they might also act very silly due to having wrong beliefs at the time they froze their prior (stopped learning), especially when dealing with the future and complex situations.
And of course, it’s not like any agent arrives at a certain point where it knows enough, and can freeze their prior forever: there’s always an infinite amount of information to learn, and it’s hard to judge a priori which of it might be useful, and which of it might be counter-productive to learn. A scared agent who just thought for 3 seconds about commitment races doesn’t yet have a good picture of what important considerations it might miss out if it simply commits to being aggressive. We might think a current human is in a better position, indeed we know more things and apparently haven’t lost any important strategicness. But even then, our current information might be nothing compared to the complexities of interactions between superintelligences. So, were we to fix our prior and go updateless today, we wouldn’t really know what we might be missing on, and it might importantly backfire.
Still, there might be some ways in which we can be strategic and sensible about what information to update on. We might be able to notice patterns like “realizations in this part of concept-space are usually safe”. And it’s even conceivable that these proxies work very well, and superintelligences notice that (they don’t get stuck in commitment races before noticing it), and have no problem coordinating. But we are also importantly uncertain about whether that’s the case. And even more about how common are priors which think that’s the case.
It’s not looking like there will exist a simple, perfect delimitation, that determines whether we should update on any particular piece of information. Rather, a complex and dynamic mess of uncertain opinions about game theory and the behavior of other agents will determine whether committing not to update on X seems net-positive. In a sense, this happens because different agents might have different priors or path-dependent chains of thought.
A recently-booted AGI, still with only the same knowledge of game theory we have, would probably find itself in our same situation: a few of the possible commitments seem clearly net-postive, a few other net-negative… and for most of them, it’s very uncertain, and has to vaguely assess (without thinking too much!) whether it seems better to think more about them or to enact them already.
Even a fully-fledged superintelligence who knows a lot more than us (because it has chosen to update on much information) might find itself in an analogous situation: most of the value of the possible commitments depends on how other, similarly complex (and similarly advanced in game theory) superintelligences think and react. So to speak, as its knowledge of game theory increases, the base-line complexity of the interactions it has to worry about also increases.
This doesn’t prohibit that very promising interventions exist to help partially alleviate these problems. We humans don’t yet grasp the possible complexities of superintelligent bargaining. But if we saw, for example, a nascent AGI falling for a very naive commitment due to commitment races, we could ensure it instead learns some pieces of information that we are lucky enough to already be pretty sure are robustly safe. For example, Safe Pareto Improvements.
Other lenses
It might be these characterizations are importantly wrong because superintelligences think about decision theory (or their equivalent) in a fundamentally different way[9]. Then we’d be even more uncertain about whether something akin to “a very concrete kind of updatelessness being the norm” happens. But there are still a few different lenses that can be informative.
One such lens are selection pressures. Does AI training, or more generally physical reality and multi-agentic conflict, select for more updateless agents?
Here’s a reason why it might not: Updatelessness (that is, maximizing your current prior and leaving no or little room for future revision) can be seen as a “very risky expected value bet” (at least in the game-theoretic scenarios that seem most likely). As an example, an agent committing to always be aggressive (because she thinks it’s likely enough that others will be dovish) will receive an enormous payoff in a few worlds (those in which they are dovish), but also a big punishment in the rest (those in which they aren’t, and there’s conflict). Being updateful is closer to minimaxing against your opponent: you might lose some strategicness, but at least you ensure you can best-respond (thus always avoiding conflict).
But such naive and hawkish expected value maximization might not be too prevalent in reality, in the long run. Similarly to how Kelly Bettors survive longer than Expected Wealth Maximizers in betting scenarios, updateful agents (who are not “betting everything on their prior being right”) might survive longer than updateless ones.
A contrary consideration, though, is that updateful agents, due to losing strategicness and getting exploited, might lose most of their resources. It’s unclear how this all plays out in any of the following scenarios: (Single) AI training, Multi-agentic interactions on Earth, and Inter-galactic interactions.
In fact, it’s even conceivable that a superintelligence’s decision theory be mostly “up for grabs”, in the sense that which decision theory ends up being used when interacting with other superintelligences (conditional on our superintelligence getting there) is pretty path dependent, and different trainings or histories could influence it.
Another lens is generalization (which is related to selection pressures in AI training). Maybe AI training incentivizes heuristics of a certain shape (similar or different from hours), and their natural generalizations lead to this or that decision theory. We are very uncertain about what happens in both of these steps. Possibly a conceptual study of “planning-heuristic systematization/evolution”, with a similar flavor to value systematization, would shed some light.
It’s also conceivable that, when we make implementation more realistic (and further reduce the unexplained parts of decision theory), some different decision theories collapse to the same one (see here). But it’s not looking like this should collapse the updateless-updateful spectrum.[10]
Conclusion
It’s certainly not looking very likely (> 80%) that most superintelligences converge on a same neatly specified “level of updatelessness”, nor that this level is high, nor that they are all able to step behind approximately the same veil of ignorance to do acausal trade (so that they can trade with all kinds of intelligences including humans), nor that in causal interactions they can easily and “fresh-out-of-the-box” coordinate on Pareto optimality (like performing logical or value handshakes) without falling into commitment races. And from my perspective, I would even say it’s unlikely (< 50%).
And in fact, if they did implement strong updateless commitments (the way we understand them today), that’s sounding rather like bad news, since it complicates game-theoretic interactions.
Acknowledgements
Thank you to Tristan Cook, Anthony DiGiovanni and James Faville for helpful comments on a draft, as well as to Abram Demski, Jesse Clifton, Nico Macé, Daniel Herrmann, Julian Stastny and Diffractor for related discussions.
Some worries arise about how I perform this check, but we could assume I have a copy of your algorithm and can run it with a different observation (Heads instead of Tails). More realistically, I will be an imperfect predictor. But any better-than-chance prediction is enough to elicit this phenomenon.
And I can in some way guarantee to you that I haven’t chosen the digit adversarially (knowing in advance it’s Even), for example by asking you for a random number before you know what we’re gonna play.
There is a further complication here. Imagine you could confidently predict in which exact ways the VOI from knowing the digit will come in handy. For example, you expect the only such way is if a TV contest host asks you “What is the parity of the trillionth digit of pi?”. If that were the case, you could simply commit to, in that situation, computing and using the digit of pi, and then immediately forget it (preserving your strategicness for all other situations). Similarly, if you knew in advance in which exact ways the strategicness would serve you (for example, only if someone offers you this exact Counterfactual Mugging), you could do the opposite: commit to accepting the bet in that case, and still learn the digit and exploit the VOI for all other situations. The problem, of course, is we are usually also very unaware of which exact situations we might face, how different pieces of information help in them, and even which piece of information we have the possibility of learning (and its ramifications for our conceptual awareness).
It’s unclear how common such priors are (and them being less common would ameliorate the problem). We do have some reasons to expect most strategic deliberations to be safe, and probably other intelligences will also observe them (and so not feel the urge to commit without thinking). But the worry is also that an agent might notice they could be in a commitment race early on, when their non-updated prior still has arbitrarily wacky opinions on some important topics.
This would seem slightly surprising from my perspective, because our hand has seemed “forced” in many places, meaning phenomena like this trade-off seem fundamental to reality (however we formally represent it), rather than a fabrication of our formalisms. But of course we might be missing some radically alien possibilities.
In more detail, while reducing updateless behavior to “EDT + anthropic uncertainty (simulations)” is useful (like here), the same uncertainty about how we affect other computations remains (“if I take this action, how does this affect the coarse-grained version of me that the other player is simulating, and how does the other player respond to that?”), and the same strategic reasons not to learn more about them remain.
Updatelessness doesn’t solve most problems
In some discussions (especially about acausal trade and multi-polar conflict), I’ve heard the motto “X will/won’t be a problem because superintelligences will just be Updateless”. Here I’ll explain (in layman’s terms) why, as far as we know, it’s not looking likely that a super satisfactory implementation of Updatelessness exists, nor that superintelligences automatically implement it, nor that this would drastically improve multi-agentic bargaining.
Epistemic status: These insights seem like the most robust update from my work with Demski on Logical Updatelessness and discussions with CLR employees about Open-Minded Updatelessness. To my understanding, most researchers involved agree with them and the message of this post.
What is Updatelessness?
This is skippable if you’re already familiar with the concept.
It’s easier to illustrate with the following example: Counterfactual Mugging.
I will throw a fair coin.
If it lands Heads, you will be able to freely choose whether to pay me $100 (and if so, you will receive nothing in return).
If it lands Tails, I will check whether you paid me the $100 in the Heads world[1], and if so, I will pay you $1000.
When you find yourself in the Heads world, one might argue, the rational thing to do is to not pay. After all, you already know the coin landed Heads, so you will gain nothing by paying the $100 (assume this game is not iterated, etc.).
But if, before knowing how the coin lands, someone offers you the opportunity of committing to paying up in the Heads world, you will want to accept it! Indeed, you’re still uncertain about whether you’ll end up in the Heads or the Tails world (50% chance on each). If you don’t commit, you know you won’t pay if you find yourself in the Heads world (and so also won’t receive $1000 in the Tails world), so your expected payoff is $0. But if you commit, your payoff will be -$100 in the Heads world, and $1000 in the Tails world, so $450 in expectation.
This is indeed what happens to the best-known decision theories (CDT and EDT): they want to commit to paying, but if they don’t, by the time they get to the Heads world they don’t pay. We call this dynamic instability, because different (temporal) versions of the agent seem to be working against each other.
Why does this happen? Because, before seeing the coin, the agent is still uncertain about which world it will end in, and so still “cares” about what happens in both (and this is reflected in the expected value calculation, when we include both with equal weight). But upon seeing the coin land, the agent updates on the information that it’s in the Heads world, and the Tails world doesn’t exist, and so stops “caring” about the latter.
This is not so different from our utility function changing (before we were trying to maximize it in two worlds, now only in one), and we know that leads to instability.
An updateless agent would use a decision procedure that doesn’t update on how the coin lands. And thus, even if it found itself in the Heads world, it would acknowledge its previous credences gave equal weight to both worlds, and so pay up (without needing to have pre-committed to do so), because this was better from the perspective of the prior.
Indeed, Updatelessness is nothing more than “committing to maximize the expected value from the perspective of your prior” (instead of constantly updating your prior, so that the calculation of this expected value changes). This is not always straight-forward or well-defined (for example, what if you learn of a radically new insight that you had never considered at the time of setting your prior?), so we need to fill in more details to obtain a completely defined decision theory. But that’s the gist of it.
Updatelessness allows you to cooperate with your counterfactual selves (for example, your Heads self can cooperate with your Tails self), because you both care about each others’ worlds. Updatelessness allows for this kind of strategicness: instead of each counterfactual self doing its own thing (possibly at odds with other selves), they all work together to maximize expected utility according to the prior.
Also, between the two extremes of complete updatelessness (commit right now to a course of action forever that you’ll never revise, which is the only dynamically stable option) and complete updatefulness (basically EDT or CDT as usually presented), there’s an infinite array of decision theories which are “partially updateless”: they update on some kinds of information (so they’re not completely stable), but not others.[2]
The philosophical discussion about which exact decision theory seems better (and how updateless it is) is completely open (none the less because there doesn’t exist an objective metric to compare decision theories’ performance). But now you have a basic idea of why Updatelessness has some upshot.
A fundamental trade-off
This dichotomy between updating or not doesn’t only happen for empirical uncertainty (how the empirical coin will land): it also happens for logical/mathematical/computational uncertainty. Say, for instance, you are uncertain about whether the trillionth digit of pi is Even or Odd. We can play the same Counterfactual Mugging as above, just with this “mathematical coin” instead of an empirical coin[3].
So now, if you think you might play this game in the future, you have a reason not to learn about the trillionth digit of pi: you want to preserve your strategicness, your coordination with your counterfactuals, since that seems better in expected value from your current prior (which is uncertain about the digit, and so cares about both worlds). Indeed, if you learn (update on) the parity of this digit, and you let your future decisions depend on this information, then your two counterfactual selves (the one in the Even world and the one in the Odd world) might act differently (and maybe at odds), each only caring about their own world.
But you also have a lot of reasons to learn about the digit! Maybe doing so helps you understand math better, and you can use this to better navigate the world and achieve your goals in many different situations (some of which you cannot predict in advance). Indeed, the Value of Information theorems of academic decision theory basically state that updating on information is useful to forward your goals in many circumstances.[4]
So we seem to face a fundamental trade-off between the information benefits of learning (updating) and the strategic benefits of updatelessness. If I learn the digit, I will better navigate some situations which require this information, but I will lose the strategic power of coordinating with my counterfactual self, which is necessary in other situations.
Needless to say, this trade-off happens for each possible piece of information. Not only the parity of the trillionth digit of pi, but also that of the hundredth, the tenth, whether Fermat’s Last Theorem is true, etc. Of course, many humans have already updated on some of these pieces of information, although there is always an infinite amount of them we haven’t updated on yet.
Multi-agentic interactions
These worries might seem esoteric and far-fetched, since indeed Counterfactual Mugging seems like a very weird game that we’ll almost certainly never experience. But unfortunately, situations equivalent to this game are the norm in strategic multi-agentic interactions. And there as well we face this fundamental trade-off between learning and strategicness, giving rise to the commitment races problem. Let me summarize that problem, using the game of Chicken:
Say two players have conflicting goals. Each of them can decide whether to be very aggressive, possibly threatening conflict and trying to scare the other player, or to instead not try so hard to achieve their goal, and ensure at least that conflict doesn’t happen (since conflict would be very bad for both).
A strategy one could follow is to first see whether the other player is playing aggressive, and be aggressive iff the other is not being aggressive.
The problem is, if the other learns you will be playing this strategy, then they will play aggressive (even if at first they were wary of doing so), knowing you will simply let them have the win. If instead you had committed to play aggressive no matter what the other does (instead of following your strategy), then maybe the other would have been scared off, and you would have won (also without conflict).
What’s really happening here is that, in making your strategy depend on the other’s move (by updating on the other’s move), you are giving them power over your action, that they can use to their advantage. So here again we face the same trade-off: by updating, you at least ensure conflict doesn’t happen (because your action will be a best-possible-response to the others’ move), but you also lose your strategicness (because your action will be manipulatable by the other).
The commitment races problem is very insidious, because empirically seeing what the other has played is not the only way of updating on their move or strategy: thinking about what they might play to improve your guess about it (which is a kind of super-coarse-grained simulation), or even thinking about some basic game-theoretic incentives, can already give you information about the other player’s strategy. Which you might regret to have learned, due to losing strategicness, and the other player possibly learning or predicting this and manipulating your decision.
So one of the players might reason: “Okay, I have some very vague opinions about what the other might play, hmm, should I be aggressive or not?… Oh wait, oh fuck, if I think more about this I might stumble upon information I didn’t want to learn, thus losing strategicness. I should better commit already now (early on) to always being aggressive, that way the other will probably notice this, and get scared and best-respond to avoid entering conflict. Nice. [presses commitment button]”
This amounts to the player maximizing expected value from the perspective of their prior (their current vague opinions about the game), as opposed to learning more, updating the prior, and deciding on an action then. That is, they are being updateless instead of updateful, so as not to lose strategicness.
The problem here is, if all agents are okay with such extreme and early gambits (and have a high enough prior that their opponents will be dovish)[5], then they will all commit to be aggressive, and end up in conflict, which is the worst outcome possible.
This indeed can happen when completely Updateless agents face each other. The Updateless agent is scared to learn anything more than their prior, because this might lead to them losing strategicness, and thus being exploited by other players who moved first, who committed first to the aggressive action (that is, who were more Updateless). As Demski put it:
Is the trade-off avoidable?
There have been some attempts at surpassing this fundamental trade-off, somehow reconciling learning with strategicness, epistemics with instrumentality. Somehow negotiating with my other counterfactual selves, while at the same time not losing track of my own indexical position in the multi-verse.
In fact, it seems at first pretty intuitive that some solution along these lines should exist: just learn all the information, and then decide which one to use. After all, you are not forced to use the information, right?
Unfortunately, it’s not that easy, and the problem recurs at a higher level: your procedure to decide which information to use will depend on all the information, and so you will already lose strategicness. Or, if it doesn’t depend, then you are just being updateless, not using the information in any way.
In general, these attempts haven’t come to fruition.
FDT-like decision theories don’t even engage with the learning vs updatelessness question. You need to give them a prior over how certain computations affect others[7], that is, a single time slice, a prior to maximize. A non-omniscient FDT agent playing Chicken can fall into a commitment race as much as anyone, if the prior at that time recommends commitment without thinking further.
Demski and I over-stepped this by natively implementing dynamic logical learning in our framework (using Logical Inductors). And we had some ideas to reconcile learning with strategicness. But ultimately, the framework only solidified even further the fundamentality of this trade-off, and the existence of a “free parameter”: what exactly to update on.
Diffractor is working on some promising (unpublished) algorithmic results which amount to “updateless policies not being that hard to compute, and asymptotically achieving good payoffs”… but they do assume a certain structure in the decision-theoretic environment, which basically amounts to “there’s not too much information that is counter-productive to learn”. That is, there are not that many “information traps”, analogous to the usual “exploration traps” in learning theory.
Daniel Herrmann advocates for dropping talk of counterfactuals entirely and being completely updateful.
From my perspective, Open-Minded Updatelessness doesn’t push back on this fundamental trade-off. Instead, given the trade-off persists, it explores which kinds and shapes of partial commitments seem more robustly net-positive from the perspective of our current game-theoretic knowledge (that is, our current prior). But this is a point of contention, and there are on-going debates about whether OMU could get us something more.
To be clear, I’m not saying “I and others weren’t able to solve this problem, so this problem is unsolvable” (although that’s a small update). On the contrary, the important bit is that we seem to have elucidated important reasons why “the problem” is in fact a fundamental feature of mixing learning theory with game theory.
A more static (not dynamic, like Logical Inductors) and realist picture of computational counterfactuals (or, equivalently, subjunctive dependence) would help, but we have even more evidence that such a thing shouldn’t exist in principle, and there is a dependence on the observer’s ontology[8].
Superintelligences
In summary, a main worry with Updateless agents is that, although they might face some strategic advantages (in the situations they judged correctly from their prior), they might also act very silly due to having wrong beliefs at the time they froze their prior (stopped learning), especially when dealing with the future and complex situations.
And of course, it’s not like any agent arrives at a certain point where it knows enough, and can freeze their prior forever: there’s always an infinite amount of information to learn, and it’s hard to judge a priori which of it might be useful, and which of it might be counter-productive to learn. A scared agent who just thought for 3 seconds about commitment races doesn’t yet have a good picture of what important considerations it might miss out if it simply commits to being aggressive. We might think a current human is in a better position, indeed we know more things and apparently haven’t lost any important strategicness. But even then, our current information might be nothing compared to the complexities of interactions between superintelligences. So, were we to fix our prior and go updateless today, we wouldn’t really know what we might be missing on, and it might importantly backfire.
Still, there might be some ways in which we can be strategic and sensible about what information to update on. We might be able to notice patterns like “realizations in this part of concept-space are usually safe”. And it’s even conceivable that these proxies work very well, and superintelligences notice that (they don’t get stuck in commitment races before noticing it), and have no problem coordinating. But we are also importantly uncertain about whether that’s the case. And even more about how common are priors which think that’s the case.
It’s not looking like there will exist a simple, perfect delimitation, that determines whether we should update on any particular piece of information. Rather, a complex and dynamic mess of uncertain opinions about game theory and the behavior of other agents will determine whether committing not to update on X seems net-positive. In a sense, this happens because different agents might have different priors or path-dependent chains of thought.
A recently-booted AGI, still with only the same knowledge of game theory we have, would probably find itself in our same situation: a few of the possible commitments seem clearly net-postive, a few other net-negative… and for most of them, it’s very uncertain, and has to vaguely assess (without thinking too much!) whether it seems better to think more about them or to enact them already.
Even a fully-fledged superintelligence who knows a lot more than us (because it has chosen to update on much information) might find itself in an analogous situation: most of the value of the possible commitments depends on how other, similarly complex (and similarly advanced in game theory) superintelligences think and react. So to speak, as its knowledge of game theory increases, the base-line complexity of the interactions it has to worry about also increases.
This doesn’t prohibit that very promising interventions exist to help partially alleviate these problems. We humans don’t yet grasp the possible complexities of superintelligent bargaining. But if we saw, for example, a nascent AGI falling for a very naive commitment due to commitment races, we could ensure it instead learns some pieces of information that we are lucky enough to already be pretty sure are robustly safe. For example, Safe Pareto Improvements.
Other lenses
It might be these characterizations are importantly wrong because superintelligences think about decision theory (or their equivalent) in a fundamentally different way[9]. Then we’d be even more uncertain about whether something akin to “a very concrete kind of updatelessness being the norm” happens. But there are still a few different lenses that can be informative.
One such lens are selection pressures. Does AI training, or more generally physical reality and multi-agentic conflict, select for more updateless agents?
Here’s a reason why it might not: Updatelessness (that is, maximizing your current prior and leaving no or little room for future revision) can be seen as a “very risky expected value bet” (at least in the game-theoretic scenarios that seem most likely). As an example, an agent committing to always be aggressive (because she thinks it’s likely enough that others will be dovish) will receive an enormous payoff in a few worlds (those in which they are dovish), but also a big punishment in the rest (those in which they aren’t, and there’s conflict). Being updateful is closer to minimaxing against your opponent: you might lose some strategicness, but at least you ensure you can best-respond (thus always avoiding conflict).
But such naive and hawkish expected value maximization might not be too prevalent in reality, in the long run. Similarly to how Kelly Bettors survive longer than Expected Wealth Maximizers in betting scenarios, updateful agents (who are not “betting everything on their prior being right”) might survive longer than updateless ones.
A contrary consideration, though, is that updateful agents, due to losing strategicness and getting exploited, might lose most of their resources. It’s unclear how this all plays out in any of the following scenarios: (Single) AI training, Multi-agentic interactions on Earth, and Inter-galactic interactions.
In fact, it’s even conceivable that a superintelligence’s decision theory be mostly “up for grabs”, in the sense that which decision theory ends up being used when interacting with other superintelligences (conditional on our superintelligence getting there) is pretty path dependent, and different trainings or histories could influence it.
Another lens is generalization (which is related to selection pressures in AI training). Maybe AI training incentivizes heuristics of a certain shape (similar or different from hours), and their natural generalizations lead to this or that decision theory. We are very uncertain about what happens in both of these steps. Possibly a conceptual study of “planning-heuristic systematization/evolution”, with a similar flavor to value systematization, would shed some light.
It’s also conceivable that, when we make implementation more realistic (and further reduce the unexplained parts of decision theory), some different decision theories collapse to the same one (see here). But it’s not looking like this should collapse the updateless-updateful spectrum.[10]
Conclusion
It’s certainly not looking very likely (> 80%) that most superintelligences converge on a same neatly specified “level of updatelessness”, nor that this level is high, nor that they are all able to step behind approximately the same veil of ignorance to do acausal trade (so that they can trade with all kinds of intelligences including humans), nor that in causal interactions they can easily and “fresh-out-of-the-box” coordinate on Pareto optimality (like performing logical or value handshakes) without falling into commitment races. And from my perspective, I would even say it’s unlikely (< 50%).
And in fact, if they did implement strong updateless commitments (the way we understand them today), that’s sounding rather like bad news, since it complicates game-theoretic interactions.
Acknowledgements
Thank you to Tristan Cook, Anthony DiGiovanni and James Faville for helpful comments on a draft, as well as to Abram Demski, Jesse Clifton, Nico Macé, Daniel Herrmann, Julian Stastny and Diffractor for related discussions.
Some worries arise about how I perform this check, but we could assume I have a copy of your algorithm and can run it with a different observation (Heads instead of Tails). More realistically, I will be an imperfect predictor. But any better-than-chance prediction is enough to elicit this phenomenon.
This array of decision theories is not a linear order, but a lattice.
And I can in some way guarantee to you that I haven’t chosen the digit adversarially (knowing in advance it’s Even), for example by asking you for a random number before you know what we’re gonna play.
There is a further complication here. Imagine you could confidently predict in which exact ways the VOI from knowing the digit will come in handy. For example, you expect the only such way is if a TV contest host asks you “What is the parity of the trillionth digit of pi?”. If that were the case, you could simply commit to, in that situation, computing and using the digit of pi, and then immediately forget it (preserving your strategicness for all other situations). Similarly, if you knew in advance in which exact ways the strategicness would serve you (for example, only if someone offers you this exact Counterfactual Mugging), you could do the opposite: commit to accepting the bet in that case, and still learn the digit and exploit the VOI for all other situations. The problem, of course, is we are usually also very unaware of which exact situations we might face, how different pieces of information help in them, and even which piece of information we have the possibility of learning (and its ramifications for our conceptual awareness).
It’s unclear how common such priors are (and them being less common would ameliorate the problem). We do have some reasons to expect most strategic deliberations to be safe, and probably other intelligences will also observe them (and so not feel the urge to commit without thinking). But the worry is also that an agent might notice they could be in a commitment race early on, when their non-updated prior still has arbitrarily wacky opinions on some important topics.
See DiGiovanni, Carlsmith, Stastny and Dai for more.
Which we don’t know how to do, and I even believe is problematically subjective.
See here for an introductory explanation, and here for more advanced considerations.
This would seem slightly surprising from my perspective, because our hand has seemed “forced” in many places, meaning phenomena like this trade-off seem fundamental to reality (however we formally represent it), rather than a fabrication of our formalisms. But of course we might be missing some radically alien possibilities.
In more detail, while reducing updateless behavior to “EDT + anthropic uncertainty (simulations)” is useful (like here), the same uncertainty about how we affect other computations remains (“if I take this action, how does this affect the coarse-grained version of me that the other player is simulating, and how does the other player respond to that?”), and the same strategic reasons not to learn more about them remain.