This is obviously vulnerable to adversarial examples or extreme OOD settings, but then robustness seems to be increasing with compute used, and we can do a decent job of OOD-catching.
The issue is—as I understand it—under a sufficiently powerful optimizer “everything” essentially becomes adversarial, including OOD-catching itself.
I understand this in principle, but that seems to imply that for less scary AGIs, this might actually work. That unlocks a pretty massive part of the solution space (e.g. helping with alignment). Obviously we don’t know exactly how much, but that seems reasonably testable (e.g. OOD detection is also a precondition to self-driving cars so people know how to make it well-calibrated).
It’s not a “solution”, but it’s substantially harder to imagine a catastrophic failure from a large AGI project that isn’t actually bidding for superintelligence.
The issue is—as I understand it—under a sufficiently powerful optimizer “everything” essentially becomes adversarial, including OOD-catching itself.
I understand this in principle, but that seems to imply that for less scary AGIs, this might actually work. That unlocks a pretty massive part of the solution space (e.g. helping with alignment). Obviously we don’t know exactly how much, but that seems reasonably testable (e.g. OOD detection is also a precondition to self-driving cars so people know how to make it well-calibrated).
It’s not a “solution”, but it’s substantially harder to imagine a catastrophic failure from a large AGI project that isn’t actually bidding for superintelligence.