Neither captures it quite imo. I think it’s mostly an attempt at deconfusion:
We can’t hope to solve alignment by sufficiently nudging the relevant AI’s utility function since to know something about the utility function (as argued here) requires either predicting it (not just tweaking it & crossing your fingers really hard) or predicting the AI’s behavior. This is a substantially harder problem than the term alignment suggests on the surface and it’s one that it seems we cannot avoid. Interpretability (as far as I’m aware) is nowhere near this. The prediction section makes an epistemic argument though. It suggests that solving the alignment problem is much harder than one might think, not how likely doom by default is. For this, we would need to know the sampling space we’ll be in when developing AGI, and as shminux points out, a definition that delineates the safe part of that sampling space.
The second thesis is more strictly on the impossibility of safety guarantees but I think it’s most interesting in conjunction with the first one: If we need prediction for alignment and prediction is impossible for computationally universal systems, then the alignment problem may also be impossible to solve for such systems. Thus, the fire thesis (which is a possibility we should consider, nothing that’s shown here. There might be all kinds of informal clues for safety; a bit more on that in the discussion section here)
Neither captures it quite imo. I think it’s mostly an attempt at deconfusion:
We can’t hope to solve alignment by sufficiently nudging the relevant AI’s utility function since to know something about the utility function (as argued here) requires either predicting it (not just tweaking it & crossing your fingers really hard) or predicting the AI’s behavior. This is a substantially harder problem than the term alignment suggests on the surface and it’s one that it seems we cannot avoid. Interpretability (as far as I’m aware) is nowhere near this. The prediction section makes an epistemic argument though. It suggests that solving the alignment problem is much harder than one might think, not how likely doom by default is. For this, we would need to know the sampling space we’ll be in when developing AGI, and as shminux points out, a definition that delineates the safe part of that sampling space.
The second thesis is more strictly on the impossibility of safety guarantees but I think it’s most interesting in conjunction with the first one: If we need prediction for alignment and prediction is impossible for computationally universal systems, then the alignment problem may also be impossible to solve for such systems. Thus, the fire thesis (which is a possibility we should consider, nothing that’s shown here. There might be all kinds of informal clues for safety; a bit more on that in the discussion section here)