Rohin Shah comments on The alignment problem in different capability regimes

Rohin Shah 15 Sep 2021 9:58 UTC
LW: 2 AF: 2
0
AF
Planned summary for the Alignment Newsletter:
One reason that researchers might disagree on what approaches to take for alignment is that they might be solving different versions of the alignment problem. This post identifies two axes on which the “type” of alignment problem can differ. First, you may consider AI systems with differing levels of capability, ranging from subhuman to wildly superintelligent, with human-level somewhere in the middle. Second, you might be thinking about different mechanisms by which this leads to bad outcomes, where possible mechanisms include <@the second species problem@>(@AGI safety from first principles@) (where AIs seize control of the future from us), the “missed opportunity” problem (where we fail to use AIs as well as we could have, but the AIs aren’t themselves threatening us), and a grab bag of other possibilities (such as misuse of AI systems by bad actors).
Depending on where you land on these axes, you will get to rely on different assumptions that change what solutions you would be willing to consider:
1. **Competence.** If you assume that the AI system is human-level or superintelligent, you probably don’t have to worry about the AI system causing massive problems through incompetence (at least, not to a greater extent than humans do).
2. **Ability to understand itself.** With wildly superintelligent systems, it seems reasonable to expect them to be able to introspect and answer questions about its own cognition, which could be a useful ingredient in a solution that wouldn’t work in other regimes.
3. **Inscrutable plans or concepts.** With sufficiently competent systems, you might be worried about the AI system making dangerous plans you can’t understand, or reasoning with concepts you will never comprehend. Your alignment solution must be robust to this.
Planned opinion:
When I talk about alignment, I am considering the second species problem, with AI systems whose capability level is roughly human-level or more (including “wildly superintelligent”).
I agree with [this comment thread](https://www.alignmentforum.org/posts/HHunb8FPnhWaDAQci/the-alignment-problem-in-different-capability-regimes?commentId=mz6hYiWqNMTwHNrhC) that the core _problem_ in what-I-call-alignment stays conserved across capability levels, but the solutions can change across capability levels. (Also, other people mean different things by “alignment”, such that this would no longer be true.)