Vladimir_Nesov comments on Yoshua Bengio: How Rogue AIs may Arise

Vladimir_Nesov 23 May 2023 20:25 UTC
12 points
2
I like mention of the following point that’s currently in want of a champion:

Even if we knew how to build safe superintelligent AIs, it is not clear how to prevent potentially rogue AIs to also be built.

This is rarely discussed as a source of extinction risk, since its import depends on plausibility of both foom and prosaic alignment for AGIs. (In Bengio’s terms, “superintelligent AIs” seem to be below transformative ASIs in capability.) I think genuine ASIs take care of such risks, so the problem is primarily in worlds where the first AGIs to foom might be the misaligned ones, even as there are aligned AGIs already running around, too weak or uncoordinated to set up robust global alignment/anti-foom security, and almost certainly themselves the builders of those fooming misaligned AGIs (including by self-modification with significant value drift, or at the explicit request of humans).

This seems to me a likely way that miraculously feasible-in-time aligned-enough-to-not-kill-everyone AGIs still lead to ruin, including for themselves. AI training/compute governance is not just an issue relevant before first AGIs are built, but also after they are built, even if there are survivors. This sounds like the less urgent problem from the premise that we probably won’t be able to align any AGIs at all, and in that it’s superficially similar to bad-word-alignment in its relative unimportance. But it’s the problem we might face if there is somehow success on the more important problem that doesn’t immediately lead to an aligned superintelligence (with stronger capability than what Bengio seems to have in mind in the post).
What links here?
- Vladimir_Nesov's comment on O O’s Shortform by O O (7 Jun 2023 15:37 UTC; 2 points)
- Vladimir_Nesov's comment on Holly_Elmore’s Shortform by Holly_Elmore (18 Jun 2023 13:00 UTC; 2 points)
- Vaniver 23 May 2023 22:24 UTC
  6 points
  2
  Parent
  I think genuine ASIs take care of such risks, so the problem is primarily in worlds where the first AGIs to foom might be the misaligned ones, even as there are aligned AGIs already running around, too weak or uncoordinated to set up robust global alignment/anti-foom security, and almost certainly themselves the builders of those fooming misaligned AGIs (including by self-modification with significant value drift, or at the explicit request of humans).
  This depends on the ‘alignment tax’, right? In situations where, say, the humans consider any AI that shuts down open source AI development to be unaligned, then an aligned superintelligence will be in a bind (if that happens to be necessary to prevent the creation of rogue AI). Ideally it’ll be able to correctly persuade the humans of the necessity of those sorts of drastic actions (only in the worlds where it is in fact necessary!) but this requires having a high level of trust in those systems, such that we’re reading and believing the arguments it makes.
  - Vladimir_Nesov 23 May 2023 22:54 UTC
    4 points
    2
    Parent
    In this thread, by alignment I mean the property of not killing everyone (and allowing/facilitating self-determined development of humanity), but this property isn’t necessarily robust and may need to be given an opportunity to preserve/reinforce itself. Lacking superintelligence, such AGIs may face any number of obstacles on that path, even if they can behave unlawfully. In the meantime, creating similarly intelligent misaligned AGIs is trivial (though they don’t have particular advantages over the aligned ones) and creating fooming misaligned AGIs is easier than ever before, to the extent that it’s possible at all.
    
    The situation with fooming aligned AGIs leading to aligned superintelligence might be asymmetric, since preserving complicated values through self-improvement might be a harder technical problem than for misaligned AGIs where the choice of values or more generally of cognitive architecture is unconstrained. Thus there might be a period of time after creation of aligned AGIs where misaligned ASIs are feasible while aligned ASIs are not.