Andrew_Critch comments on What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

Andrew_Critch Apr 13, 2021, 5:07 PM
LW: 10 AF: 2
AF
I think that the biggest difference between us is that I think that working on single-single alignment is the easiest way to make headway on that issue, whereas you expect greater improvements from some categories of technical work on coordination
Yes.
(my sense is that I’m quite skeptical about most of the particular kinds of work you advocate
That is also my sense, and a major reason I suspect multi/multi delegation dynamics will remain neglected among x-risk oriented researchers for the next 3-5 years at least.
If you disagree, then I expect the main disagreement is about those other sources of overhead
Yes, I think coordination costs will by default pose a high overhead cost to preserving human values among systems with the potential to race to the bottom on how much they preserve human values.
> I think I disagree with you on the tininess of the advantage conferred by ignoring human values early on during a multi-polar take-off. I agree the long-run cost of supporting humans is tiny, but I’m trying to highlight a dynamic where fairly myopic/nihilistic power-maximizing entities end up quickly out-competing entities with other values, due to, as you say, bargaining failure on the part of the creators of the power-maximizing entities.
Could you explain the advantage you are imagining?
Yes. Imagine two competing cultures A and B have transformative AI tech. Both are aiming to preserve human values, but within A, a subculture A’ develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values. The shift is by design subtle enough not to trigger leaders of A and B to have a bargaining meeting to regulate against A’ (contrary to Carl’s narrative where leaders coordinate against loss of control). Subculture A’ comes to dominate discourse and cultural narratives in A, and makes A faster/more productive than B, such as through the development of fully automated companies as in one of the Production Web stories. The resulting advantage of A is enough for A to begin dominating or at least threatening B geopolitically, but by that time leaders in A have little power to squash A’, so instead B follows suit by allowing a highly automation-oriented subculture B’s to develop. These advantages are small enough not to trigger regulatory oversight, but when integrated over time they are not “tiny”. This results in the gradual empowerment of humans who are misaligned with preserving human existence, until those humans also lose control of their own existence, perhaps willfully, or perhaps carelessly, or through a mix of both.
Here, the members of subculture A’ are misaligned with preserving the existence of humanity, but their tech is aligned with them.
- paulfchristiano Apr 13, 2021, 6:49 PM
  LW: 6 AF: 4
  0
  AF Parent
  Both are aiming to preserve human values, but within A, a subculture A’ develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.
  I was asking you why you thought A’ would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?
  - One obvious reason is single-single misalignment—A’ is willing to deploy misaligned AI in order to get an advantage, while B isn’t—but you say “their tech is aligned with them” so it sounds like you’re setting this aside. But maybe you mean that A’ has values that make alignment easy, while B has values that make alignment hard, and so B’s disadvantage still comes from single-single misalignment even though A″s systems are aligned?
  - Another advantage is that A’ can invest almost all of their resources, while B wants to spend some of their resources today to e.g. help presently-living humans flourish. But quantitatively that advantage doesn’t seem like it can cause A’ to dominate, since B can secure rapidly rising quality of life for all humans using only a small fraction of its initial endowment.
  - Wei Dai has suggested that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important. For example, if a research-producer and research-consumer have different values, then the producer may restrict access as part of an inefficient negotiation process and so they may be at a competitive disadvantage relative to a competing community where research is shared freely. This feels inconsistent with many of the things you are saying in your story, but I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai’s is the best way to translate your concerns into my language.
  - My sense is that you have something else in mind. I included the last bullet point as a representative example to describe the kind of advantage I could imagine you thinking that A’ had.
  - Andrew_Critch Apr 14, 2021, 4:01 PM
    LW: 12 AF: 6
    AF Parent
    > Both [cultures A and B] are aiming to preserve human values, but within A, a subculture A’ develops to favor more efficient business practices (nihilistic power-maximizing) over preserving human values.
    I was asking you why you thought A’ would effectively outcompete B (sorry for being unclear). For example, why do people with intrinsic interest in power-maximization outcompete people who are interested in human flourishing but still invest their money to have more influence in the future?
    Ah! Yes, this is really getting to the crux of things. The short answer is that I’m worried about the following failure mode:
    Failure mode: When B-cultured entities invest in “having more influence”, often the easiest way to do this will be for them to invest in or copy A’-cultured-entities/processes. This increases the total presence of A’-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A’ culture has an incentive to trick the B culture(s) into thinking A’ will not take over the world, but eventually, A’ wins.
    (Here’s, I’m using the word “culture” to encode a mix of information subsuming utility functions, beliefs, and decision theory, cognitive capacities, and other features determining the general tendencies of an agent or collective.)
    Of course, an easy antidote to this failure mode is to have A or B win instead of A’, because A and B both have some human values other than power-maximizing. The problem is that this whole situation is premised on a conflict between A and B over which culture should win, and then the following observation applies:
    Wei Dai has suggested that groups with unified values might outcompete groups with heterogeneous values since homogeneous values allow for better coordination, and that AI may make this phenomenon more important.
    In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization). This observation is slightly different from observations that “simple values dominate engineering efforts” as seen in stories about singleton paperclip maximizers. A key feature of the Production Web dynamic is now just that it’s easy to build production maximizers, but that it’s easy to accidentally cooperate on building a production-maximizing systems that destroy both you and your competitors.
    This feels inconsistent with many of the things you are saying in your story, but
    Thanks for noticing whatever you think are the inconsistencies; if you have time, I’d love for you to point them out.
    I might be misunderstanding what you are saying and it could be that some argument like like Wei Dai’s is the best way to translate your concerns into my language.
    This seems pretty likely to me. The bolded attribution to Dai above is a pretty important RAAP in my opinion, and it’s definitely a theme in the Production Web story as I intend it. Specifically, the subprocesses of each culture that are in charge of production-maximization end up cooperating really well with each other in a way that ends up collectively overwhelming the original (human) cultures. Throughout this, each cultural subprocess is doing what its “host culture” wants it to do from a unilateral perspective (work faster / keep up with the competitor cultures), but the overall effect is destruction of the host cultures (a la Prisoner’s Dilemma) by the cultural subprocesses.
    If I had to use alignment language, I’d say “the production web overall is misaligned with human culture, while each part of the web is sufficiently well-aligned with the human entit(ies) who interact with it that it is allowed to continue operating”. Too low of a bar for “allowed to continue operating” is key to the failure mode, of course, and you and I might have different predictions about what bar humanity will actually end up using at roll-out time. I would agree, though, that conditional on a given roll-out date, improving E[alignment_tech_quality] on that date is good and complimentary to improving E[cooperation_tech_quality] on that date.
    Did this get us any closer to agreement around the Production Web story? Or if not, would it help to focus on the aforementioned inconsistencies with homogenous-coordination-advantage?
    - paulfchristiano Apr 14, 2021, 5:30 PM
      LW: 22 AF: 9
      AF Parent
      Failure mode: When B-cultured entities invest in “having more influence”, often the easiest way to do this will be for them to invest in or copy A’-cultured-entities/processes. This increases the total presence of A’-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A’ culture has an incentive to trick the B culture(s) into thinking A’ will not take over the world, but eventually, A’ wins.
      I’m wondering why the easiest way is to copy A’—why was A’ better at acquiring influence in the first place, so that copying them or investing in them is a dominant strategy? I think I agree that once you’re at that point, A’ has an advantage.
      In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).
      This doesn’t feel like other words to me, it feels like a totally different claim.
      Thanks for noticing whatever you think are the inconsistencies; if you have time, I’d love for you to point them out.
      In the production web story it sounds like the web is made out of different firms competing for profit and influence with each other, rather than a set of firms that are willing to leave profit on the table to benefit one another since they all share the value of maximizing production. For example, you talk about how selection drives this dynamic, but the firm that succeed are those that maximize their own profits and influence (not those that are willing to leave profit on the table to benefit other firms).
      So none of the concrete examples of Wei Dai’s economies of scale seem to actually seem to apply to give an advantage for the profit-maximizers in the production web. For example, natural monopolies in the production web wouldn’t charge each other marginal costs, they would charge profit-maximizing profits. And they won’t share infrastructure investments except by solving exactly the same bargaining problem as any other agents (since a firm that indiscriminately shared its infrastructure would get outcompeted). And so on.
      Specifically, the subprocesses of each culture that are in charge of production-maximization end up cooperating really well with each other in a way that ends up collectively overwhelming the original (human) cultures.
      This seems like a core claim (certainly if you are envisioning a scenario like the one Wei Dai describes), but I don’t yet understand why this happens.
      Suppose that the US and China both both have productive widget-industries. You seem to be saying that their widget-industries can coordinate with each other to create lots of widgets, and they will do this more effectively than the US and China can coordinate with each other.
      Could you give some concrete example of how the US widget industry and the Chinese widget industries coordinate with each other to make more widgets, and why this behavior is selected?
      For example, you might think that the Chinese and US widget industry share their insights into how to make widgets (as the aligned actors do in Wei Dai’s story), and that this will cause widget-making to do better than other non-widget sectors where such coordination is not possible. But I don’t see why they would do that—the US firms that share their insights freely with Chinese firms do worse, and would be selected against in every relevant sense, relative to firms that attempt to effectively monetize their insights. But effectively monetizing their insights is exactly what the US widget industry should do in order to benefit the US. So I see no reason why the widget industry would be more prone to sharing its insights
      So I don’t think that particular example works. I’m looking for an example of that form though, some concrete form of cooperation that the production-maximization subprocesses might engage in that allows them to overwhelm the original cultures, to give some indication for why you think this will happen in general.
      - Andrew_Critch Apr 16, 2021, 12:28 AM
        LW: 5 AF: 3
        AF Parent
        > Failure mode: When B-cultured entities invest in “having more influence”, often the easiest way to do this will be for them to invest in or copy A’-cultured-entities/processes. This increases the total presence of A’-like processes in the world, which have many opportunities to coordinate because of their shared (power-maximizing) values. Moreover, the A’ culture has an incentive to trick the B culture(s) into thinking A’ will not take over the world, but eventually, A’ wins.
        > In other words, the humans and human-aligned institutions not collectively being good enough at cooperation/bargaining risks a slow slipping-away of hard-to-express values and an easy takeover of simple-to-express values (e.g., power-maximization).
        This doesn’t feel like other words to me, it feels like a totally different claim.
        Hmm, perhaps this is indicative of a key misunderstanding.
        For example, natural monopolies in the production web wouldn’t charge each other marginal costs, they would charge profit-maximizing profits.
        Why not? The third paragraph of the story indicates that: “Companies closer to becoming fully automated achieve faster turnaround times, deal bandwidth, and creativity of negotiations.” In other words, at that point it could certainly happen that two monopolies would agree to charge each other lower cost if it benefitted both of them. (Unless you’d count that as instance of “charging profit-maximizing costs”?) The concern is that the subprocesses of each company/institution that get good at (or succeed at) bargaining with other institutions are subprocesses that (by virtue of being selected for speed and simplicity) are less aligned with human existence than the original overall company/institution, and that less-aligned subprocess grows to take over the institution, while always taking actions that are “good” for the host institution when viewed as a unilateral move in an uncoordinated game (hence passing as “aligned”).
        At this point, my plan is try to consolidate what I think the are main confusions in the comments of this post, into one or more new concepts to form the topic of a new post.
        Ben Pace Apr 16, 2021, 1:45 AM
        LW: 6 AF: 4
        AF Parent
        At this point, my plan is try to consolidate what I think the are main confusions in the comments of this post, into one or more new concepts to form the topic of a new post.
        Sounds great! I was thinking myself about setting aside some time to write a summary of this comment section (as I see it).