I think genuine ASIs take care of such risks, so the problem is primarily in worlds where the first AGIs to foom might be the misaligned ones, even as there are aligned AGIs already running around, too weak or uncoordinated to set up robust global alignment/anti-foom security, and almost certainly themselves the builders of those fooming misaligned AGIs (including by self-modification with significant value drift, or at the explicit request of humans).
This depends on the ‘alignment tax’, right? In situations where, say, the humans consider any AI that shuts down open source AI development to be unaligned, then an aligned superintelligence will be in a bind (if that happens to be necessary to prevent the creation of rogue AI). Ideally it’ll be able to correctly persuade the humans of the necessity of those sorts of drastic actions (only in the worlds where it is in fact necessary!) but this requires having a high level of trust in those systems, such that we’re reading and believing the arguments it makes.
In this thread, by alignment I mean the property of not killing everyone (and allowing/facilitating self-determined development of humanity), but this property isn’t necessarily robust and may need to be given an opportunity to preserve/reinforce itself. Lacking superintelligence, such AGIs may face any number of obstacles on that path, even if they can behave unlawfully. In the meantime, creating similarly intelligent misaligned AGIs is trivial (though they don’t have particular advantages over the aligned ones) and creating fooming misaligned AGIs is easier than ever before, to the extent that it’s possible at all.
The situation with fooming aligned AGIs leading to aligned superintelligence might be asymmetric, since preserving complicated values through self-improvement might be a harder technical problem than for misaligned AGIs where the choice of values or more generally of cognitive architecture is unconstrained. Thus there might be a period of time after creation of aligned AGIs where misaligned ASIs are feasible while aligned ASIs are not.
This depends on the ‘alignment tax’, right? In situations where, say, the humans consider any AI that shuts down open source AI development to be unaligned, then an aligned superintelligence will be in a bind (if that happens to be necessary to prevent the creation of rogue AI). Ideally it’ll be able to correctly persuade the humans of the necessity of those sorts of drastic actions (only in the worlds where it is in fact necessary!) but this requires having a high level of trust in those systems, such that we’re reading and believing the arguments it makes.
In this thread, by alignment I mean the property of not killing everyone (and allowing/facilitating self-determined development of humanity), but this property isn’t necessarily robust and may need to be given an opportunity to preserve/reinforce itself. Lacking superintelligence, such AGIs may face any number of obstacles on that path, even if they can behave unlawfully. In the meantime, creating similarly intelligent misaligned AGIs is trivial (though they don’t have particular advantages over the aligned ones) and creating fooming misaligned AGIs is easier than ever before, to the extent that it’s possible at all.
The situation with fooming aligned AGIs leading to aligned superintelligence might be asymmetric, since preserving complicated values through self-improvement might be a harder technical problem than for misaligned AGIs where the choice of values or more generally of cognitive architecture is unconstrained. Thus there might be a period of time after creation of aligned AGIs where misaligned ASIs are feasible while aligned ASIs are not.