Daniel Kokotajlo comments on The Commitment Races problem

Daniel Kokotajlo 8 May 2022 6:31 UTC
LW: 27 AF: 10
0
AF
I agree with all this I think.

This is why I said commitment races happen between consequentialists (I defined that term more narrowly than you do; the sophisticated reasoning you do here is nonconsequentialist by my definition). I agree that agents worthy of the label “rational” will probably handle these cases gracefully and safely.
However, I’m not yet supremely confident that the AGIs we end up building will handle these cases gracefully and safely. I would love to become more confident & am looking for ways to make it more likely.
If today you go around asking experts for an account of rationality, they’ll pull off the shelf CDT or EDT or game-theoretic rationality (nash equilibria, best-respond to opponent) -- something consequentialist in the narrow sense. I think there is a nonzero chance that the relevant AGI will be like this too, either because we explicitly built it that way or because in some young dumb early stage it (like humans) picks up ideas about how to behave from its environment. Or else maybe because narrow-consequentialism works pretty well in single-agent environments and many muti-agent environments too, and maybe by the time the AGI is able to self-modify to something more sophisticated it is already thinking about commitment races and already caught in their destructive logic.

(ETA: Insofar as you are saying: “Daniel, worrying about this is silly, any AGI smart enough to kill us all will also be smart enough to not get caught in commitment races” then I say… I hope so! But I want to think it through carefully first; it doesn’t seem obvious to me, for the above reasons.)
- Wei Dai 13 Jul 2023 6:44 UTC
  LW: 30 AF: 13
  0
  AF Parent
  I think I’m less sure than @Eliezer Yudkowsky that there is a good solution to the problem of commitment races, even in theory, or that if there is a solution, it has the shape that he thinks it has. I’ve been thinking about this problem off and on since 2009, and haven’t made much progress. Others have worked on this too (as you noted in the OP), and all seem to have gotten stuck at roughly the same place that I got stuck. Eliezer described what he would do in a particular game, but I don’t know how to generalize his reasoning (which you call “nonconsequentialist”) and incorporate it into a decision theory, even informally (e.g., on the same level of formality as my original description of UDT1.1 or UDT2).
  
  As an alternative to Eliezer’s general picture, it also seems plausible to me that the solution to commitment races looks like everyone trying to win the races by being as clever as they can (using whatever tricks one can think of to make the best commitments as quickly as possible while minimizing the downsides of doing so), or a messy mix of racing and trading/cooperating. UDT2 sort of fits into or is compatible with this picture, but might be far from the cleverest thing we can do (if this picture turns out to be correct).
  
  To summarize, I think the commitment races problem poses a fundamental challenge to decision theory, and is not just a matter of “we know roughly or theoretically what should be done, and we just have to get AGI to do it.” (I’m afraid some readers might get the latter impression from your exchange with Eliezer.)
  What links here?
  - Long Reflection Reading List by Will Aldred (EA Forum; 24 Mar 2024 16:27 UTC; 101 points)
  - Wei Dai's comment on Even if we lose, we win by Morphism (15 Jan 2024 9:54 UTC; 26 points)
  - Eliezer Yudkowsky 17 Jul 2023 4:30 UTC
    LW: 11 AF: 6
    0
    AF Parent
    TBC, I definitely agree that there’s some basic structural issue here which I don’t know how to resolve. I was trying to describe properties I thought the solution needed to have, which ruled out some structural proposals I saw as naive; not saying that I had a good first-principles way to arrive at that solution.
  - Daniel Kokotajlo 13 Jul 2023 21:12 UTC
    LW: 6 AF: 4
    2
    AF Parent
    Great comment. To reply I’ll say a bit more about how I think of this stuff for the past few years:
    
    I agree that the commitment races problem poses a fundamental challenge to decision theory, in the following sense: There may not exist a simple algorithm in the same family of algorithms as EDT, CDT, UDT 1.0, 1.1, and even 2.0, that does what we’d consider a good job in a realistic situation characterized by many diverse agents interacting over some lengthy period with the ability to learn about each other and make self-modifications (including commitments). Indeed it may be that the top 10% of humans by performance in environments like this, or even the top 90%, outperform the best possible simple-algorithm-in-that-family. Thus any algorithm for making decisions that would intuitively be recognized as a decision theory, would be worse in realistic environments than the messy neural net wetware of many existing humans, and probably far worse than the best superintelligences. (To be clear, I still hold out hope that this is false and such a simple in-family algorithm does exist.)
    
    I therefore think we should widen our net and start considering algorithms that don’t fit in the traditional decision theory family. For example, think of a human role model (someone you consider wise, smart, virtuous, good at philosophy, etc.) and then make them into an imaginary champion by eliminating what few faults they still have, and increasing their virtues to the extent possible, and then imagine them in a pleasant and secure simulated environment with control over their own environment and access to arbitrary tools etc. and maybe also the ability to make copies of themselves HCH style. You have now have described an algorithm that can be compared to the performance of EDT, UDT 2.0, etc. and arguably will be superior to all of them (because this wise human can use their tools to approximate or even directly compute such things to the extent that they deem it useful to do so). We can then start thinking about flaws in this algorithm, and see if we can fix them. (Another algorithm to consider is the human champion alone, without all the fancy tooling and copy-generation ability. Even this might still be better than CDT, EDT, UDT, etc.)
    
    Another example:
    
    Consider a standard expected utility maximizer of some sort (e.g. EDT) but with the following twist: It also has a deontological or almost-deontological constraint that prevents it from getting exploited. How is this implemented? Naive first attempt: It has some “would this constitute me being exploited?” classifier which it can apply to imagined situations, and which it constantly applies whenever it’s thinking about what to do, and it doesn’t take actions that trigger the classifier to a sufficiently high degree. Naive second attempt: “Me getting exploited” is assigned huge negative utility. (I suspect these might be equivalent, but also they might not be, anyhow moving on...) What can we say about this agent?
    
    Well, it all depends on how good its classifier is, relative to the adversaries it is realistically likely to face. Are its adversaries able to find any adversarial examples to its classifier that they can implement in practice? Things that in some sense SHOULD count as exploitation, but which it won’t classify as exploitation and thus will fall for?
    
    Moreover, is its classifier wasteful/clumsy/etc., hurting it’s own performance in other ways in order to achieve the no-exploitation property?
    
    I think this might not be a hard problem. If you are facing adversaries significantly more intelligent than you, or who can simulate you in detail such that they can spend lots of compute to find adversarial examples by brute force, you are kinda screwed anyway probably and so it’s OK if you are vulnerable to exploitation by them. Moreover there are probably fixes to even those failure modes—e.g. plausibly “they used their simulation of me + lots of compute to find a solution that would give them lots of my stuff but not count as exploitation according to my classifier” can just be something that your classifier classifies as exploitation. Anything even vaguely resembling that can be classified as exploitation. So you’d only be exploitable in practice if they had the simulation of you but you didn’t know they did. M
    
    oreover, that’s just the case where you have a fixed/frozen classifier. More sophisticated designs could have more of a ‘the constitution is a living document’ vibe, a process for engaging in Philosophical/Moral Reasoning that has the power to modify the classifier as it sees fit—but importantly, still applies the classifier to its own thinking processes, so it won’t introduce a backdoor route to exploitation.An
    
    other tool in the toolbox: Infohazard management. There’s a classic tradeoff which you discovered, in the context of UDT 2.0 at least, where if you run the logical inductor for longer you risk making yourself exploitable or otherwise losing to agents that are early enough in logical time that you learn about their behavior (and they predict that you’ll learn about their behavior) and so they exploit you. But on the other hand, if you pause the logical inductor and let the agent make self-modifications too soon, the self-modifications it makes might be really stupid/crazy. Well, infohazard management maybe helps solve this problem. Make a cautious self-modification along the lines of “let’s keep running the logical inductor, but let’s not think much yet about what other potentially-exploitative-or-adversarial agents might do.” Perhaps things mostly work out fine if the agents in the commitment race are smart enough to do something like this before they stumble across too much information about each other.An
    
    other tool in the toolbox: Learn from history: Heuristics / strategies / norms / etc. for how to get along in commitment race environment can be learned from history via natural selection, cultural selection, and reading history books. People have been in similar situations in the past, e.g. in some cultures people could unilaterally swear oaths/promises and would lose lots of status if they didn’t uphold them. Over history various cultures have developed concepts of fairness that diverse agents with different interests can use to coordinate without incentivizing exploiters; we have a historical record which we can use to judge how well these different concept work, including how well they work when different people come from different cultures with different fairness concepts.An
    
    other thing to mention: The incentive to commit to brinksmanshippy, exploitative policies is super strong to the extent that you are confident that the other agents you will interact with are consequentialists. But to the extent that you expect many of those agents to be nonconsequentialists with various anti-exploitation defenses (e.g. the classifier system I described above, or whatever sort of defenses they may have evolved culturally or genetically) the incentive is goes in the opposite direction—doing brinksmanshippy / bully-ish strategies is going to waste resources at best and get you into lots of nasty fights with high probability and plausibly even get everyone to preemptively gang up on you. An
    
    d this is important because once you understand the commitment races problem, you realize that consequentialism is a repulsor state, not an attractor state; moreover, realistic agents (whether biological or artificial) will not begin their life as consequentialists except if specifically constructed to be that way. Moreover their causal history will probably contain lots of learned/evolved anti-exploitation defenses, some of which may have made its way into their minds.Zo
    
    oming out again: The situation seems extremely messy, but not necessarily grim. I’m certainly worried—enough to make this one of my main priorities!--but I think that agents worthy of being called “rational” will probably handle all this stuff more gracefully/competently than humans do, and I think (compared to how naive consequentialists would handle it, and certainly compared to how it COULD go) humans handle it pretty well. That is, I agree that “the solution to commitment races looks like everyone trying to win the races by being as clever as they can (using whatever tricks one can think of to make the best commitments as quickly as possible while minimizing the downsides of doing so), or a messy mix of racing and trading/cooperating,” but I think that given what I’ve said in this comment, and some other intuitions which I haven’t articulated, overall I expect things to go significantly better in expectation than they go with humans. The sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan. (Dath Ilan is Yudkowsky’s fantasy utopia of cooperative competence)
    - Anthony DiGiovanni 14 Jul 2023 14:08 UTC
      4 points
      0
      Parent
      It also has a deontological or almost-deontological constraint that prevents it from getting exploited.
      I’m not convinced this is robustly possible. The constraint would prevent this agent from getting exploited conditional on the potential exploiters best-responding (being “consequentialists”). But it seems to me the whole heart of the commitment races problem is that the potential exploiters won’t necessarily do this, indeed depending on their priors they might have strong incentives not to. (And they might not update those priors for fear of losing bargaining power.)
      That is, these exploiters will follow the same qualitative argument as us — “if I don’t commit to demand x%, and instead compromise with others’ demands to avoid conflict, I’ll lose bargaining power” — and adopt their own pseudo-deontological constraints against being fair. Seems that adopting your deontological strategy requires assuming one’s bargaining counterparts will be “consequentialists” in a similar way as (you claim) the exploitative strategy requires. And this is why Eliezer’s response to the problem is inadequate.
      There might be various symmetry-breakers here, but I’m skeptical they favor the fair/nice agents so strongly that the problem is dissolved.
      I think this is a serious challenge and a way that, as you say, an exploitation-resistant strategy might be “wasteful/clumsy/etc., hurting it’s own performance in other ways in order to achieve the no-exploitation property.” At least, unless certain failsafes against miscoordination are used—my best guess is these look like some variant of safe Pareto improvements that addresses the key problem discussed in this post, which I’ve worked on recently (as you know).
      Given this, I currently think the most promising approach to commitment races is to mostly punt the question of the particular bargaining strategy to smarter AIs, and our job is to make sure robust SPI-like things are in place before it’s too late.
      - Daniel Kokotajlo 14 Jul 2023 16:28 UTC
        2 points
        0
        Parent
        Exploitation means the exploiter benefits. If you are a rock, you can’t be exploited. If you are an agent who never gives in to threats, you can’t be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won’t benefit them, then you might get nasty things done to you. You wouldn’t be exploited, but you’d still be very unhappy.
        
        So no, I don’t think the constraint I proposed would only work if the opponent agents were consequentialists. Adopting the strategy does not assume one’s bargaining counterparts will be consequentialists. However, if you are a consequentialist, then you’ll only adopt the strategy if you think that sufficiently few of the agents you will later encounter are of the aforementioned nasty sort—which, by the logic of commitment races, is not guaranteed; it’s plausible that at least some of the agents you’ll encounter are ‘already committed’ to being nasty to you unless you surrender to them, such that you’ll face much nastiness if you make yourself inexploitable. This is my version of what you said above, I think. And yeah to put it in my ontology, some exploitation-resistant strategies might be wasteful/clumsy/etc. and depending on how nasty the other agents are, maybe most or even all exploitation-resistant strategies are more trouble than they are worth (from a consequentialist perspective; note that nonconsequentialists might have additional reasons to go for exploitation-resistant strategies. Also note that even consequentialists might assign intrinsic value to justice, fairness, and similar concepts.)
        But like I said, I’m overall optimistic—not enough to say “there’s no problem here,” it’s enough of a problem that it’s one of my top priorities (and maybe my top priority?) but I still do expect the sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.
        
        Agree re punting the question. I forgot to mention that in my list above, as a reason to be optimistic; I think that not only can we human AI designers punt on the question to some extent, but AGIs can punt on it as well to some extent. Instead of hard-coding in a bargaining strategy, we / future AGIs can do something like “don’t think in detail about the bargaining landscape and definitely not about what other adversarial agents are likely to commit to, until I’ve done more theorizing about commitment races and cooperation and discovered & adopted bargaining strategies that have really nice properties.”
        Anthony DiGiovanni 15 Jul 2023 19:52 UTC
        3 points
        0
        Parent
        Exploitation means the exploiter benefits. If you are a rock, you can’t be exploited. If you are an agent who never gives in to threats, you can’t be exploited (at least by threats, maybe there are other kinds of exploitation). That said, yes, if the opponent agents are the sort to do nasty things to you anyway even though it won’t benefit them, then you might get nasty things done to you. You wouldn’t be exploited, but you’d still be very unhappy.
        Cool, I think we basically agree on this point then, sorry for misunderstanding. I just wanted to emphasize the point I made because “you won’t get exploited if you decide not to concede to bullies” is kind of trivially true. :) The operative word in my reply was “robustly,” which is the hard part of dealing with this whole problem. And I think it’s worth keeping in mind how “doing nasty things to you anyway even though it won’t benefit them” is a consequence of a commitment that was made for ex ante benefits, it’s not the agent being obviously dumb as Eliezer suggests. (Fortunately, as you note in your other comment, some asymmetries should make us think these commitments are rare overall; I do think an agent probably needs to have a pretty extreme-by-human-standards, little-to-lose value system to want to do this… but who knows what misaligned AIs might prefer.)
        Daniel Kokotajlo 14 Jul 2023 16:39 UTC
        3 points
        0
        Parent
        Re: Symmetry: Yes, that’s why I phrased the original commitment races post the way I did. For both commitments designed to exploit others, and commitments designed to render yourself less exploitable, (and for that matter for commitments not in either category) you have an incentive to do them ‘first,’ early in your own subjective time and also in particular before you think about what others will do, so that your decision isn’t logically downstream of theirs, and so that hopefully theirs is logically downstream of yours. You have an incentive to be the first-mover, basically.
        
        And yeah I do suspect there are various symmetry-breakers that favor various flavors of fairness and niceness and cooperativeness, and disfavor brinksmanshippy risky strategies, but I’m far from confident that the cumulative effect is strong enough to ‘dissolve’ the problem. If I thought the problem was dissolved I would not still be prioritizing it!
        What links here?
        Anthony DiGiovanni's comment on The Commitment Races problem by Daniel Kokotajlo (15 Jul 2023 19:52 UTC; 3 points)
    - Wei Dai 14 Jul 2023 20:58 UTC
      LW: 3 AF: 3
      4
      AF Parent
      
      I think that agents worthy of being called “rational” will probably handle all this stuff more gracefully/competently than humans do
      
      Humans are kind of terrible at this right? Many give in even to threats (bluffs) conjured up by dumb memeplexes and back up by nothing (i.e., heaven/hell), popular films are full of heros giving in to threats, apparent majority of philosophers have 2-boxing intuitions (hence the popularity of CDT, which IIUC was invented specifically because some philosophers were unhappy with EDT choosing to 1-box), governments negotiate with terrorists pretty often, etc.
      
      The sort of society AGIs construct will be at least as cooperatively-competent / good-at-coordinating-diverse-agents-with-diverse-agendas-and-beliefs as Dath Ilan.
      
      If we build AGI that learn from humans or defer to humans on this stuff, do we not get human-like (in)competence?^[1]^[2] If humans are not atypical, large parts of the acausal society/economy could be similarly incompetent? I imagine there could be a top tier of “rational” superintelligences, built by civilizations that were especially clever or wise or lucky, that cooperate with each other (and exploit everyone else who can be exploited), but I disagree with this second quoted statement, which seems overly optimistic to me. (At least for now; maybe your unstated reasons to be optimistic will end up convincing me.)
      
      ↩︎
      I can see two ways to improve upon this: 1) AI safety people seem to have better intuitions (cf popularity of 1-boxing among alignment researchers) and maybe can influence the development of AGI in a better direction, e.g., to learn from / defer to humans with intuitions more like themselves. 2) We figure out metaphilosophy, which lets AGI figure out how to improve upon humans. (ETA: However, conditioning on there not being a simple and elegant solution to decision theory also seems to make metaphilosophy being simple and elegant much less likely. So what would “figure out metaphilosophy” mean in that case?)
      
      ↩︎
      I can also see the situation potentially being even worse, since many future threats will be very “out of distribution” for human evolution/history/intuitions/reasoning, so maybe we end up handling them even worse than current threats.
      - Daniel Kokotajlo 14 Jul 2023 21:30 UTC
        LW: 2 AF: 2
        0
        AF Parent
        Yes. Humans are pretty bad at this stuff, yet still, society exists and mostly functions. The risk is unacceptably high, which is why I’m prioritizing it, but still, by far the most likely outcome of AGIs taking over the world—if they are as competent at this stuff as humans are—is that they talk it over, squabble a bit, maybe get into a fight here and there, create & enforce some norms, and eventually create a stable government/society. But yeah also I think that AGIs will be by default way better than humans at this sort of stuff. I am worried about the “out of distibution” problem though, I expect humans to perform worse in the future than they perform in the present for this reason.
        
        Yes, some AGIs will be better than others at this, and presumably those that are worse will tend to lose out in various ways on average, similar to what happens in human society.
        
        Consider that in current human society, a majority of humans would probably pay ransoms to free loved ones being kidnapped. Yet kidnapping is not a major issue; it’s not like 10% of the population is getting kidnapped and paying ransoms every year. Instead, the governments of the world squash this sort of thing (well, except for failed states etc.) and do their own much more benign version, where you go to jail if you don’t pay taxes & follow the laws. When you say “the top tier of rational superintelligences exploits everyone else” I say that is analogous to “the most rational/clever/capable humans form an elite class which rules over and exploits the masses.” So I’m like yeah, kinda sorta I expect that to happen, but it’s typically not that bad? Also it would be much less bad if the average level of rationality/capability/etc. was higher?
        
        I’m not super confident in any of this to be clear.
        
        Wei Dai 14 Jul 2023 23:27 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        But yeah also I think that AGIs will be by default way better than humans at this sort of stuff.
        
        What’s your reasons for thinking this? (Sorry if you already explained this and I missed your point, but it doesn’t seem like you directly addressed my point that if AGIs learn from or defer to humans, they’ll be roughly human-level at this stuff?)
        
        When you say “the top tier of rational superintelligences exploits everyone else” I say that is analogous to “the most rational/clever/capable humans form an elite class which rules over and exploits the masses.” So I’m like yeah, kinda sorta I expect that to happen, but it’s typically not that bad?
        
        I think it could be much worse than current exploitation, because technological constraints prevent current exploiters from extracting full value from the exploited (have to keep them alive for labor, can’t make them too unhappy or they’ll rebel, monitoring for and repressing rebellions is costly). But with superintelligence and future/acausal threats, an exploiter can bypass all these problems by demanding that the exploited build an AGI aligned to itself and let it take over directly.
        Daniel Kokotajlo 15 Jul 2023 13:41 UTC
        LW: 2 AF: 2
        0
        AF Parent
        I agree that if AGIs defer to humans they’ll be roughly human-level, depending on which humans they are deferring to. If I condition on really nasty conflict happening as a result of how AGI goes on earth, a good chunk of my probability mass (and possibly the majority of it?) is this scenario. (Another big chunk, possibly bigger, is the “humans knowingly or unknowingly build naive consequentialists and let rip” scenario, which is scarier because it could be even worse than the average human, as far as I know). Like I said, I’m worried.
        
        If AGIs learn from humans though, well, it depends on how they learn, but in principle they could be superhuman.
        
        Re: analogy to current exploitation: Yes there are a bunch of differences which I am keen to study, such as that one. I’m more excited about research agendas that involve thinking through analogies like this than I am about what people interested in this topic seem to do by default, which is think about game theory and Nash bargaining and stuff like that. Though I do agree that both are useful and complementary.