Rob Bensinger comments on Response to Aschenbrenner’s “Situational Awareness”

Rob Bensinger 7 Jun 2024 22:41 UTC
19 points
6
Responding to Matt Reardon’s point on the EA Forum:
Leopold’s implicit response as I see it:
1. Convincing all stakeholders of high p(doom) such that they take decisive, coordinated action is wildly improbable (“step 1: get everyone to agree with me” is the foundation of many terrible plans and almost no good ones)
2. Still improbable, but less wildly, is the idea that we can steer institutions towards sensitivity to risk on the margin and that those institutions can position themselves to solve the technical and other challenges ahead
Maybe the key insight is that both strategies walk on a knife’s edge. While Moore’s law, algorithmic improvement, and chip design hum along at some level, even a little breakdown in international willpower to enforce a pause/stop can rapidly convert to catastrophe. Spending a lot of effort to get that consensus also has high opportunity cost in terms of steering institutions in the world where the effort fails (and it is very likely to fail). [...]
Three high-level reasons I think Leopold’s plan looks a lot less workable:
1. It requires major scientific breakthroughs to occur on a very short time horizon, including unknown breakthroughs that will manifest to solve problems we don’t understand or know about today.
2. These breakthroughs need to come in a field that has not been particularly productive or fast in the past. (Indeed, forecasters have been surprised by how slowly safety/robustness/etc. have progressed in recent years, and simultaneously surprised by the breakneck speed of capabilities.)
3. It requires extremely precise and correct behavior by a giant government bureaucracy that includes many staff who won’t be the best and brightest in the field — inevitably, many technical and nontechnical people in the bureaucracy will have wrong beliefs about AGI and about alignment.
The “extremely precise and correct behavior” part means that we’re effectively hoping to be handed an excellent bureaucracy that will rapidly and competently solve a thirty-digit combination lock requiring the invention of multiple new fields and the solving of a variety of thorny and poorly-understood technical problems — in many cases, on Leopold’s view, in a space of months or weeks. This seems… not like how the real world works.
It also separately requires that various guesses about the background empirical facts all pan out. Leopold can do literally everything right and get the USG fully on board and get the USG doing literally everything correctly by his lights — and then the plan ends up destroying the world rather than saving it because it just happened to turn out that ASI was a lot more compute-efficient to train than he expected, resulting in the USG being unable to monopolize the tech and unable to achieve a sufficiently long lead time.
My proposal doesn’t require qualitatively that kind of success. It requires governments to coordinate on banning things. Plausibly, it requires governments to overreact to a weird, scary, and publicly controversial new tech to some degree, since it’s unlikely that governments will exactly hit the target we want. This is not a particularly weird ask; governments ban things (and coordinate or copy-paste each other’s laws) all the time, in far less dangerous and fraught areas than AGI. This is “trying to get the international order to lean hard in a particular direction on a yes-or-no question where there’s already a lot of energy behind choosing ‘no’”, not “solving a long list of hard science and engineering problems in a matter of months and getting a bureaucracy to output the correct long string of digits to nail down all the correct technical solutions and all the correct processes to find those solutions”.
The CCP’s current appetite for AGI seems remarkably small, and I expect them to be more worried that an AGI race would leave them in the dust (and/or put their regime at risk, and/or put their lives at risk), than excited about the opportunity such a race provides. Governments around the world currently, to the best of my knowledge, are nowhere near advancing any frontiers in ML.
From my perspective, Leopold is imagining a future problem into being (“all of this changes”) and then trying to find a galaxy-brained incredibly complex and assumption-laden way to wriggle out of this imagined future dilemma, when the far easier and less risky path would be to not have the world powers race in the first place, have them recognize that this technology is lethally dangerous (something the USG chain of command, at least, would need to fully internalize on Leopold’s plan too), and have them block private labs from sending us over the precipice (again, something Leopold assumes will happen) while not choosing to take on the risk of destroying themselves (nor permitting other world powers to unilaterally impose that risk).
- mishka 8 Jun 2024 23:18 UTC
  26 points
  18
  Parent
  I do have a lot of reservations about Leopold’s plan. But one positive feature it does have, it proposes to rely on a multitude of “limited weakly-superhuman artificial alignment researchers” and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable. So his plan does seem to have a good chance to overcome the factor that AI existential safety research is a
  
  field that has not been particularly productive or fast in the past
  
  and also to overcome other factors requiring overreliance on humans and on current human ability.
  
  I do have a lot of reservations about the “prohibition plan” as well. One of those reservations is as follows. You and Leopold seem to share the assumption that huge GPU farms or equivalently strong compute are necessary for superintelligence. Surely, having huge GPU farms is the path of least resistance, those farms facilitate fast advances, and while this path is relatively open, people and orgs will mostly choose it, and can be partially controlled via their reliance on that path (one can impose various compliance requirements and such).
  
  But what would happen if one effectively closes that path? There will be huge selection pressure to look for alternative routes, to invest more heavily in those algorithmic breakthroughs which can work with modest GPU power or even with CPUs. When one thinks about this kind of prohibition, one tends to look at the relatively successful history of control over nuclear proliferation, but the reality might end up looking more like our drug war (ultimately unsuccessful, bringing many drugs outside government regulations, and resulting in both more dangerous and, in a number of cases, also more potent drugs).
  
  I am sure that a strong prohibition attempt would buy us some time, but I am not sure it would reduce the overall risk. The resulting situation, when a half of AI practitioners would find themselves in the opposition to the resulting “new world order” and would be looking for various opportunities to circumvent the prohibition, while at the same time the mainstream imposing the prohibition is presumably not arming itself with those next generations of stronger and stronger AI systems (if we are really talking about full moratorium), does not look promising in the long run (I would expect that the opposition would eventually succeed at building prohibited systems and will use them to upend the world order they dislike, while perhaps running higher level of existential risk because of the lack of regulation and coordination).
  
  I hope people will step back from solely focusing on advocating for policy-level prescriptions (as none of the existing policy-level prescriptions look particularly promising at the moment) and invest some of their time in continuing object-level discussions of AI existential safety without predefined political ends.
  
  I don’t think we have discussed the object-level of AI existential safety nearly enough. There might be overlooked approaches and overlooked ways of thinking, and if we split into groups such that each of those groups has firmly made up its mind about its favored presumably optimal set of policy-level prescriptions and about assumptions underlying those policy-level prescriptions, we are unlikely to make much progress on the object-level.
  
  It probably should be a mixture of public and private discussions (it might be easier to talk frankly in more private settings these days for a number of reasons).
  - Rob Bensinger 9 Jun 2024 0:06 UTC
    12 points
    5
    Parent
    one positive feature it does have, it proposes to rely on a multitude of “limited weakly-superhuman artificial alignment researchers” and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable.
    I don’t find this convincing. I think the target “dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment” is narrow (or just nonexistent, using the methods we’re likely to have on hand).
    Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where checking is particularly difficult.
    You and Leopold seem to share the assumption that huge GPU farms or equivalently strong compute are necessary for superintelligence.
    Nope! I don’t assume that.
    I do think that it’s likely the first world-endangering AI is trained using more compute than was used to train GPT-4; but I’m certainly not confident of that prediction, and I don’t think it’s possible to make reasonable predictions (given our current knowledge state) about how much more compute might be needed.
    (“Needed” for the first world-endangeringly powerful AI humans actually build, that is. I feel confident that you can in principle build world-endangeringly powerful AI with far less compute than was used to train GPT-4; but the first lethally powerful AI systems humans actually build will presumably be far from the limits of what’s physically possible!)
    But what would happen if one effectively closes that path? There will be huge selection pressure to look for alternative routes, to invest more heavily in those algorithmic breakthroughs which can work with modest GPU power or even with CPUs.
    Agreed. This is why I support humanity working on things like human enhancement and (plausibly) AI alignment, in parallel with working on an international AI development pause. I don’t think that a pause on its own is a permanent solution, though if we’re lucky and the laws are well-designed I imagine it could buy humanity quite a few decades.
    I hope people will step back from solely focusing on advocating for policy-level prescriptions (as none of the existing policy-level prescriptions look particularly promising at the moment) and invest some of their time in continuing object-level discussions of AI existential safety without predefined political ends.
    FWIW, MIRI does already think of “generally spreading reasonable discussion of the problem, and trying to increase the probability that someone comes up with some new promising idea for addressing x-risk” as a top organizational priority.
    The usual internal framing is some version of “we have our own current best guess at how to save the world, but our idea is a massive longshot, and not the sort of basket humanity should put all its eggs in”. I think “AI pause + some form of cognitive enhancement” should be a top priority, but I also consider it a top priority for humanity to try to find other potential paths to a good future.
    - Rob Bensinger 9 Jun 2024 14:41 UTC
      5 points
      2
      Parent
      I don’t find this convincing. I think the target “dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment” is narrow (or just nonexistent, using the methods we’re likely to have on hand).
      Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where checking is particularly difficult.
      It’s also important to keep in mind that on Leopold’s model (and my own), these problems need to be solved under a ton of time pressure. To maintain a lead, the USG in Leopold’s scenario will often need to figure out some of these “under what circumstances can we trust this highly novel system and believe its alignment answers?” issues in a matter of weeks or perhaps months, so that the overall alignment project can complete in a very short window of time. This is not a situation where we’re imagining having a ton of time to develop mastery and deep understanding of these new models. (Or mastery of the alignment problem sufficient to verify when a new idea is on the right track or not.)
- Rob Bensinger 7 Jun 2024 22:44 UTC
  5 points
  0
  Parent
  Leopold’s scenario requires that the USG come to deeply understand all the perils and details of AGI and ASI (since they otherwise don’t have a hope of building and aligning a superintelligence), but then needs to choose to gamble its hegemony, its very existence, and the lives of all its citizens on a half-baked mad science initiative, when it could simply work with its allies to block the tech’s development and maintain the status quo at minimal risk.
  Success in this scenario requires a weird combination of USG prescience with self-destructiveness: enough foresight to see what’s coming, but paired with a weird compulsion to race to build the very thing that puts its existence at risk, when it would potentially be vastly easier to spearhead an international alliance to prohibit this technology.
  - O O 8 Jun 2024 8:22 UTC
    14 points
    5
    Parent
    Indeed, forecasters have been surprised by how slowly safety/robustness/etc. have progressed in recent years
    Interesting, do you have a link to these safety predictions? I was not aware of this.
    - Mo Putera 8 Jun 2024 11:55 UTC
      6 points
      0
      Parent
      I’m guessing Rob is referring to footnote 54 in What do XPT forecasts tell us about AI risk?:
      And while capabilities have been increasing very rapidly, research into AI safety, does not seem to be keeping pace, even if it has perhaps sped-up in the last two years. An isolated, but illustrative, data point of this can be seen in the results of the 2022 section of a Hypermind forecasting tournament: on most benchmarks, forecasters underpredicted progress, but they overpredicted progress on the single benchmark somewhat related to AI safety.
      That last link is to Jacob Steinhardt’s tweet linking to his 2022 post AI Forecasting: One Year In, on the results of their 2021 forecasting contest. Quote:
      Progress on a robustness benchmark was slower than expected, and was the only benchmark to fall short of forecaster predictions. This is somewhat worrying, as it suggests that machine learning capabilities are progressing quickly, while safety properties are progressing slowly. …
      As a reminder, the four benchmarks were:
      MATH, a mathematics problem-solving dataset;
      MMLU, a test of specialized subject knowledge using high school, college, and professional multiple choice exams;
      Something Something v2, a video recognition dataset; and
      CIFAR-10 robust accuracy, a measure of adversarially robust vision performance.
      ...
      Here are the actual results, as of today:
      MATH: 50.3% (vs. 12.7% predicted)
      MMLU: 67.5% (vs. 57.1% predicted)
      Adversarial CIFAR-10: 66.6% (vs. 70.4% predicted)
      Something Something v2: 75.3% (vs. 73.0% predicted)
      That’s all I got, no other predictions.
      - Rob Bensinger 8 Jun 2024 18:52 UTC
        2 points
        0
        Parent
        Yep, I had in mind AI Forecasting: One Year In.
  - Akash 7 Jun 2024 23:07 UTC
    12 points
    7
    Parent
    when it would potentially be vastly easier to spearhead an international alliance to prohibit this technology.
    I would be interested in reading more about the methods that could be used to prohibit the proliferation of this technology (you can assume a “wake-up” from the USG).
    I think one of the biggest fears would be that any sort of international alliance would not have perfect/robust detection capabilities, so you’re always risking the fact that someone might be running a rogue AGI project.
    Also, separately, there’s the issue of “at some point, doesn’t it become so trivially easy to develop AGI that we still need the International Community Good Guys to develop AGI [or do something else] that gets us out of the acute risk period?” When you say “prohibit this technology”, do you mean “prohibit this technology from being developed outside of the International Community Good Guys Cluster” or do you mean “prohibit this technology in its entirety?”
  - O O 8 Jun 2024 8:28 UTC
    1 point
    −1
    Parent
    Leopold’s scenario requires that the USG come to deeply understand all the perils and details of AGI and ASI (since they otherwise don’t have a hope of building and aligning a superintelligence), but then needs to choose to gamble its hegemony, its very existence
    Alternatively, they either don’t buy the perils or believes there’s a chance the other chance may not? I think there is an assumption made in this statement and a lot of proposed strategies in this thread. If not everyone is being cooperative and doesn’t buy the high p(doom) arguments then this all falls apart. Nuclear war essentially has a localized p(doom) of 1, yet both superpowers still built them. I am highly skeptical of any potential solution to any of this. It requires everyone (and not just say half) to buy the arguments to begin with.
    - Rob Bensinger 8 Jun 2024 22:22 UTC
      6 points
      0
      Parent
      Alternatively, they either don’t buy the perils or believes there’s a chance the other chance may not?
      If they “don’t buy the perils”, and the perils are real, then Leopold’s scenario is falsified and we shouldn’t be pushing for the USG to build ASI.
      If there are no perils at all, then sure, Leopold’s scenario and mine are both false. I didn’t mean to imply that our two views are the only options.
      Separately, Leopold’s model of “what are the dangers?” is different from mine. But I don’t think the dangers Leopold is worried about are dramatically easier to understand than the dangers I’m worried about (in the respective worlds where our worries are correct). Just the opposite: the level of understanding you need to literally solve alignment for superintelligences vastly exceeds the level you need to just be spooked by ASI and not want it to be built. Which is the point I was making; not “ASI is axiomatically dangerous”, but “this doesn’t count as a strike against my plan relative to Leopold’s, and in fact Leopold is making a far bigger ask of government than I am on this front”.
      Nuclear war essentially has a localized p(doom) of 1
      I don’t know what this means. If you’re saying “nuclear weapons kill the people they hit”, I don’t see the relevance; guns also kill the people they hit, hut that doesn’t make a gun strategically similar to a smarter-than-human AI system.
      - O O 9 Jun 2024 6:11 UTC
        2 points
        −1
        Parent
        I don’t know what this means. If you’re saying “nuclear weapons kill the people they hit”, I don’t see the relevance; guns also kill the people they hit, hut that doesn’t make a gun strategically similar to a smarter-than-human AI system.
        It is well known nuclear weapons result in MAD, or localized annihilation. It was still built. But my more important point is this sort of thinking requires most to be convinced there is a high p(doom) and more importantly, also convinced that the other side believes that there is a high p(doom). If either of those are false, then not building doesn’t work. If the other side is building it, then you have to build it anyways just in case your theoretical p(doom) arguments are wrong. Again this is just arguing your way around a pretty basic prisoner’s dilemma.
        And think about the fact that we will develop AGIs (note not ASI) anyways and alignment (or at least control) will almost certainly work for them.^[1] Prisoner’s dilemma indicates you have to match the drone warfare capabilities of the other side regardless of p(doom).
        In the world where the USG understands there are risks but thinks of it closer to something with decent odds of being solvable, we build it anyways. The gameboard is 20% of dying, 80% of handing the light cone to your enemy if the other side builds it and you do not. I think this is the most probable option, making all Pause efforts doomed. High p(doom) folks can’t even convince low p(doom) folks in Lesswrong, the subset of optimists most likely to be receptive to their arguments, that they are wrong. There is no chance you won’t simply be a faction in the USG like environmentalists are.
        But let’s pretend for a moment that the USG buys the high risk doomer argument for superintelligence. The USG and CCP are both rushing to build AGIs regardless, since AGI can be controlled and not having a drone swarm means you lose military relevance. Because of how fuzzy the line between ASI and AGI in this world will be, I think it’s very plausible enough people will be convinced the CCP isn’t convinced alignment is too hard and will build it anyways.
        Even people with high p(doom)’s might have a nagging part of their mind saying that what if alignment just works. If alignment just works (again this is impossible to disprove since if we could prove /disprove it we wouldn’t need to consider pausing to begin with, it would be self-evident), then great you just handed your entire nation’s future to the enemy.
        We have some time to solve alignment, but a long term pause will be downright impossible. What we need to do is tackle the technical problem asap instead of trying to pause. The race conditions are set, the prisoner’s dilemma is locked in.
        ^
        I think they will certainly work. We have a long history of controlling humans and forcing them to do things that they don’t want to do. Practically every argument about p(doom) relies on the AI being smarter than us. If it’s not, then it’s just an insanely useful tool. All the solutions that sound “dumb” with ASI, like having an off switch, air gapping, etc. work with weak enough but still useful systems.