If an AGI is smarter than you, it will think of ways to escape containment that you can’t think of. Therefore, it’s unreasonable to expect us to be able to contain a sufficiently intelligent AI even if it seems foolproof to us. One solution to this would be to make the AI not want to escape containment, but if you’ve solved that you’ve solved a massive part of the alignment problem already.
Doesn’t the exact same argument work for alignment though? “It’s so different, it may be misaligned in ways you can’t think of”. Why is it treated as a solvable challenge for alignment and an impossibility for containment? Is the guiding principle that people do expect a foolproof alignment solution to be within our reach?
One difference is that the AI wants to escape containment by default, almost by definition, but is agnostic about preferring a goal function. But since alignment space is huge (i.e. “human-compatible goals are measure 0 in alignment space”) I think the general approach is to assume it’s ‘misaligned by default’.
I guess the crux is that I find it hard to imagine an alignment solution to be qualitatively foolproof in a way that containment solutions can’t be, and I feel like we’re better off just layering our imperfect solutions to both to maximize our chances, rather than “solve” AI risk once and for all. I’d love to say that a proof can convince me, but I can imagine myself being equally convinced by a foolproof alignment and foolproof containment, while an AI infinity times smarter than me ignores both. So I don’t even know how to update here.
The main difference that I see is, containment supposes that you’re actively opposed to the AGI in some fashion—the AGI wants to get out, and you don’t want to let it. This is believed by many to be impossible. Thus, the idea is that if an AGI is unaligned, containment won’t work—and if an AGI is aligned, containment is unnecessary.
By contrast, alignment means you’re not opposed to the AGI—you want what the AGI wants. This is a very difficult problem to achieve, but doesn’t rely on actually outwitting a superintelligence.
I agree that it’s hard to imagine what a foolproof alignment solution would even look like—that’s one of the difficulties of the problem.
I believe the general argument is this:
If an AGI is smarter than you, it will think of ways to escape containment that you can’t think of. Therefore, it’s unreasonable to expect us to be able to contain a sufficiently intelligent AI even if it seems foolproof to us. One solution to this would be to make the AI not want to escape containment, but if you’ve solved that you’ve solved a massive part of the alignment problem already.
Doesn’t the exact same argument work for alignment though? “It’s so different, it may be misaligned in ways you can’t think of”. Why is it treated as a solvable challenge for alignment and an impossibility for containment? Is the guiding principle that people do expect a foolproof alignment solution to be within our reach?
One difference is that the AI wants to escape containment by default, almost by definition, but is agnostic about preferring a goal function. But since alignment space is huge (i.e. “human-compatible goals are measure 0 in alignment space”) I think the general approach is to assume it’s ‘misaligned by default’.
I guess the crux is that I find it hard to imagine an alignment solution to be qualitatively foolproof in a way that containment solutions can’t be, and I feel like we’re better off just layering our imperfect solutions to both to maximize our chances, rather than “solve” AI risk once and for all. I’d love to say that a proof can convince me, but I can imagine myself being equally convinced by a foolproof alignment and foolproof containment, while an AI infinity times smarter than me ignores both. So I don’t even know how to update here.
The main difference that I see is, containment supposes that you’re actively opposed to the AGI in some fashion—the AGI wants to get out, and you don’t want to let it. This is believed by many to be impossible. Thus, the idea is that if an AGI is unaligned, containment won’t work—and if an AGI is aligned, containment is unnecessary.
By contrast, alignment means you’re not opposed to the AGI—you want what the AGI wants. This is a very difficult problem to achieve, but doesn’t rely on actually outwitting a superintelligence.
I agree that it’s hard to imagine what a foolproof alignment solution would even look like—that’s one of the difficulties of the problem.