Ivan Vendrov comments on Alignment being impossible might be better than it being really difficult

Ivan Vendrov 26 Jul 2022 0:35 UTC
2 points
1
Thought-provoking post, thanks. It crystallizes two related reasons we might expect capability ceilings or diminishing returns to intelligence:
1. Unpredictability from chaos—as you say, “any simulation of reality with that much detail will be too costly”.
2. Difficulty of distributing goal-driven computation across physical space.
where human-AI alignment is a special case of (2).
How much does your core argument rely on solutions to the alignment problem being all or nothing? A more realistic model in my view is that there’s a spectrum of “alignment tech”, where better tech lets you safely delegate to more and more powerful systems.
- Martín Soto 26 Jul 2022 1:28 UTC
  1 point
  0
  Parent
  Thank you! Really nice enunciation of the capabilities ceilings.
  Regarding the question, I certainly haven’t included that nuance in my brief exposition, and it should be accounted for as you mention. This will probably have non-continuous (or at least non-smooth) consequences for the risks graph.
  TL;DR: Most of our risk comes from not alignign our first AGI (discontinuity), and immediately after that an increase in difficulty will almost only penalize said AGI, so its capabilities and our risk will decrease (the AI might be able to solve some forms of alignment and not others). I think this alters the risk distribution but not the red quantity. If anything it points at comparing the risk of impossible difficulty to the risk of exactly that difficulty which allows us to solve alignment (and not an unspecified “very difficult” difficulty), which could already be deduced from the setting (even if I didn’t explicitly mention it).
  Detailed explanation:
  Say humans can correctly align with their objectives agents of complexity up to $C_{H}$ (or more accurately, below this complexity the alignment will be completely safe or almost harmless with a high probability). And analogously $C_{S}$ for the superintelligence we are considering (or even for any superintelligence whatsoever if some very hard alignment problems are physically unsolvable, and thus this quantity is finite).
  For humans, whether $C_{H}$ is slightly above or below the complexity of the first AGI we ever build will have vast consequences for our existential risks (and we know for socio-political reasons that this AGI will likely be built, etc.). But for the superintelligence, it is to be expected that its capabilities and power will be continuous in $C_{S}$ (it has no socio-political reasons due to which failing to solve a certain alignment problem will end its capabilities).
  The (direct) dependence I mention in the post between its capabilities and human existential risk can be expected to be continuous as well (even if maybe close to constant because “a less capable unaligned superintelligence is already bad enough”). Since both $C_{H}$ and $C_{S}$ are expected to depend continuously (or at least approximately continuously at macroscopic levels) and inversely on the difficulty of solving alignment, we’d have a very steep increase in risk at the point where humans fail to solve their alignment problem (where risk depends inversely and non-continuously on $C_{H}$ ), and no similar steep decrease as AGI capabilities lower (where risk depends directly and continuously on $C_{S}$ ).