If you’re trying to align wildly superintelligent systems, you don’t have to worry about any concern related to your system being incompetent.
In general, this seems false. The thing you don’t have to worry about is subhuman competence. You may still have to worry about incompetence relative to some highly superhuman competence threshold. (it may be fine to say that this isn’t an alignment problem—but it’s a worry)
One concern is reaching [competent at X] before [competent at operating safely when [competent at X]].
Here it’d be fine if the system had perfect knowledge of the risks of X, or perfect calibration of its uncertainty around such risks. Replace “perfect” with “wildly superhuman”, and you lose the guarantee. If human-level-competence would be wildly unsafe at the [...operating safely...] task, then knowing the system will do better isn’t worth much. (we’re wildly superchimp at AI safety; this may not be good enough)
I think it can sometimes be misleading to think/talk about issues “in the limit of competence”: in the limit you’re throwing away information about relative competence levels (at least unless you’re careful to take limits of all the important ratios etc. too).
E.g. take two systems: Alice: [x power, x2 wisdom] Bob: [x power, √x wisdom]
We can let x tend to infinity and say they’re both arbitrarily powerful and arbitrarily wise, but I’d still trust Alice a whole lot more than Bob at any given time (for safe exploration, and many other things). I don’t think it’s enough to say “Bob will self-modify to become Alice-like (in a singleton scenario)”. The concerning cases are where Bob has insufficient wisdom to notice or look for a desirable [self-modify to Alice] style option.
It’s conceivable to me that this is a non-problem in practice: that any system with only modestly super-human wisdom starts to make Wei Dai look like a reckless megalomaniac, regardless of its power. Even if that’s true, it seems important to think about ways to train systems such that they acquire this level of wisdom early.
Perhaps this isn’t exactly an alignment problem—but it’s the kind of thing I’d want anyone aligning wildly superintelligent systems to worry about (unless I’m missing something).
In general, this seems false. The thing you don’t have to worry about is subhuman competence. You may still have to worry about incompetence relative to some highly superhuman competence threshold. (it may be fine to say that this isn’t an alignment problem—but it’s a worry)
One concern is reaching [competent at X] before [competent at operating safely when [competent at X]].
Here it’d be fine if the system had perfect knowledge of the risks of X, or perfect calibration of its uncertainty around such risks. Replace “perfect” with “wildly superhuman”, and you lose the guarantee. If human-level-competence would be wildly unsafe at the [...operating safely...] task, then knowing the system will do better isn’t worth much. (we’re wildly superchimp at AI safety; this may not be good enough)
I think it can sometimes be misleading to think/talk about issues “in the limit of competence”: in the limit you’re throwing away information about relative competence levels (at least unless you’re careful to take limits of all the important ratios etc. too).
E.g. take two systems:
Alice: [x power, x2 wisdom]
Bob: [x power, √x wisdom]
We can let x tend to infinity and say they’re both arbitrarily powerful and arbitrarily wise, but I’d still trust Alice a whole lot more than Bob at any given time (for safe exploration, and many other things).
I don’t think it’s enough to say “Bob will self-modify to become Alice-like (in a singleton scenario)”. The concerning cases are where Bob has insufficient wisdom to notice or look for a desirable [self-modify to Alice] style option.
It’s conceivable to me that this is a non-problem in practice: that any system with only modestly super-human wisdom starts to make Wei Dai look like a reckless megalomaniac, regardless of its power. Even if that’s true, it seems important to think about ways to train systems such that they acquire this level of wisdom early.
Perhaps this isn’t exactly an alignment problem—but it’s the kind of thing I’d want anyone aligning wildly superintelligent systems to worry about (unless I’m missing something).