Ivan Vendrov comments on Any further work on AI Safety Success Stories?

Ivan Vendrov 3 Oct 2022 21:35 UTC
1 point
0
I would be interested in a detailed analysis of pivotal act vs gradual steering; my intuition is that many of the differences dissolve once you try to calculate the value of specific actions. Some unstructured thoughts below:
1. Both aim to eventually end up in a state of existential security, where nobody can ever build an unaligned AI that destroys the world. Both have to deal with the fact that power is currently broadly distributed in the world, so most plausible stories in which we end up with existential security will involve the actions of thousands if not millions of people, distributed over decades or even centuries.
2. Pivotal acts have stronger claims of impact, but generally have weaker claims of the sign of that impact—actually realistic pivotal-seeming acts like “unilaterally deploy a friendly-seeming AI singleton” or “institute a stable global totalitarianism” are extremely, existentially dangerous. If someone identifies a pivotal-seeming act that is actually robustly positive, I’ll be the first to sign on.
3. In contrast, gradual steering proposals like “improve AI lab communication” or “improve interpretability” have weaker claims to impact, but stronger claims to being net positive across many possible worlds, and are much less subject to multi-agent problems like races and the unilateralist’s curse.
4. True, complete existential safety probably requires some measure of “solving politics” and locking in current human values, hence may not be desirable. Like what if the Long Reflection decides that the negative utilitarians are right and the world should in fact be destroyed? I won’t put high credence on that, but there is some level of accidental existential risk that we should be willing to accept in order to not lock in our values.
- Krieger 4 Oct 2022 1:23 UTC
  1 point
  0
  Parent
  Is it even possible for a non-pivotal act to ever achieve existential security? Even if we max-ed up AI lab communication and had awesome interpretability, that doesn’t help in the long-run given that the amount of minimum resources required to build a misaligned AGI will probably be keep dropping.
  - Ivan Vendrov 4 Oct 2022 2:03 UTC
    1 point
    0
    Parent
    Depends on offense-defense balance, I guess. E.g. if well-intentioned and well-coordinated actors are controlling 90% of AI-relevant compute then it seems plausible that they could defend against 10% of the compute being controlled by misaligned AGI or other bad actors—by denying them resources, by hardening core infrastructure, via MAD, etc.
    - Krieger 4 Oct 2022 7:31 UTC
      1 point
      0
      Parent
      It seems like the exact model which the AI will adopt is kinda confounding my picture when I’m trying to imagine how “existentially secure” a world looks like. I’m current thinking there are two possible existentially secure worlds:
      The obvious one is where all human dependence is removed from setting/modifying the AI’s value system (like CEV, fully value-aligned)—this would look much more unipolar.
      The alternate is for the well-intentioned-and-coordianted group to use a corrigible AI that is aligned with its human instructor. To me, whether this scenario looks existentially secure probably depends on “whether small differences in capability can magnify to great power differences”—if false, it would be much easier for capable groups to defect and make their own corrigible AI push agendas that may not be in favor of humanity’s interest (hence not so existentially secure). If true, then the world would again be more unipolar—and its existential secureness would depend on how value-aligned the humans that are operating the corrigible AI are (I’m guessing this is your offense-defense balance example?)
      So it seems to me that the ideal end game is for humanity to end up with a value-aligned AI, either by starting with it or somehow going through the “dangerous period” of multipolar corrigible AIs and transition to a value-aligned one. Possible pathways (non-exhaustive).
      I’m not sure whether this is a good framing at all (probably isn’t), but simply counting the number of dependencies (without taking into consider how plausible each dependencies are) it just seems to me that humanity’s chances would be better off with a unipolar takeover scenario—either using a value-aligned AI from the start or transitioning into one after a pivotal act.