For a superintelligent AI, alignment might as well be binary, just as for practical purposes you either have a critical mass of U235 or you don’t, notwithstanding the narrow transition region. But can you expand the terms “weakly aligned” and “weakly superintelligent”? Even after searching alignmentforum.org and lesswrong.org for these, their intended meanings are not clear to me. One post says:
weak alignment means: do all of the things any competent AI researcher would obviously do when designing a safe AI.
For instance, you should ask the AI how it would respond in various hypothetical situations, and make sure it gives the “ethically correct” answer as judged by human beings.
To summarize, weak alignment, which is what this post is mostly about, would say that “everything will be all right in the end.” Strong alignment, which refers to the transient, would say that “everything will be all right in the end, and the journey there will be all right, too.”
I find it implausible that it is easier to build a machine that might destroy the world but is guaranteed to eventually rebuild it, than to build one that never destroys the world. It is easier to not make an omelette than it is to unmake one.
Agreed that for a post-intelligent explosion AI alignment is effectively binary. I do agree with the sharp left turn etc positions, and don’t expect patches and cobbled together solutions to hold up to the stratosphere.
Weakly aligned—Guided towards the kinds of things we want in ways which don’t have strong guarantees. A central example is InstructGPT, but this also includes most interpretability (unless dramatically more effective than current generation), and what I understand to be Paul’s main approaches.
Weakly superintelligent—Superintelligent in some domains, but has not yet undergone recursive self improvement.
These are probably non-standard terms, I’m very happy to be pointed at existing literature with different ones which I can adopt.
I am confident Eliezer would roll his eyes, I have read a great deal of his work and recent debates. I respectfully disagree with his claim that you can’t get useful cognitive work on alignment out of systems which have not yet FOOMed and taken a sharp left turn, based on my understanding of intelligence as babble and prune. I don’t expect us to get enough cognitive work out of these systems in time, but it seems like a path which has non-zero hope.
It is plausible that AIs unavoidably FOOM before the point that they can contribute, but this seems less and less likely as capabilities advance and we notice we’re not dead.
I don’t nearly agree with either of those, and FOOM basically requires physics violations like violating Landauer’s Principle and needing arbitrarily small processors. I’m being frank because I suspect that a lot of a doom position requires hard takeoff, and on physics and history of what happens as AI improves, only the first improvement is a discontinuity, the rest start being far more smooth and slow. So that’s a big crux I have here.
For a superintelligent AI, alignment might as well be binary, just as for practical purposes you either have a critical mass of U235 or you don’t, notwithstanding the narrow transition region. But can you expand the terms “weakly aligned” and “weakly superintelligent”? Even after searching alignmentforum.org and lesswrong.org for these, their intended meanings are not clear to me. One post says:
My shoulder Eliezer is rolling his eyes at this.
ETA: And here I find:
I find it implausible that it is easier to build a machine that might destroy the world but is guaranteed to eventually rebuild it, than to build one that never destroys the world. It is easier to not make an omelette than it is to unmake one.
Agreed that for a post-intelligent explosion AI alignment is effectively binary. I do agree with the sharp left turn etc positions, and don’t expect patches and cobbled together solutions to hold up to the stratosphere.
Weakly aligned—Guided towards the kinds of things we want in ways which don’t have strong guarantees. A central example is InstructGPT, but this also includes most interpretability (unless dramatically more effective than current generation), and what I understand to be Paul’s main approaches.
Weakly superintelligent—Superintelligent in some domains, but has not yet undergone recursive self improvement.
These are probably non-standard terms, I’m very happy to be pointed at existing literature with different ones which I can adopt.
I am confident Eliezer would roll his eyes, I have read a great deal of his work and recent debates. I respectfully disagree with his claim that you can’t get useful cognitive work on alignment out of systems which have not yet FOOMed and taken a sharp left turn, based on my understanding of intelligence as babble and prune. I don’t expect us to get enough cognitive work out of these systems in time, but it seems like a path which has non-zero hope.
It is plausible that AIs unavoidably FOOM before the point that they can contribute, but this seems less and less likely as capabilities advance and we notice we’re not dead.
I don’t nearly agree with either of those, and FOOM basically requires physics violations like violating Landauer’s Principle and needing arbitrarily small processors. I’m being frank because I suspect that a lot of a doom position requires hard takeoff, and on physics and history of what happens as AI improves, only the first improvement is a discontinuity, the rest start being far more smooth and slow. So that’s a big crux I have here.