I think the basic assumed argument here (though I’m not sure where or even if I’ve seen it explicitly laid out) goes essentially like this:
Using neural nets is more like the immune system’s “generate everything and filter out what doesn’t work” than it is like normal coding or construction. And there are limits on how much you can tamper with this, because the whole point of neural nets is that humans don’t know how to write code as good as neural nets—if we knew how to write such code deliberately, we wouldn’t need to use neural nets in the first place.
You hopefully have part of that filter designed to filter out misalignment. Presumably we agree that if you don’t have this, you are going to have a bad time.
This means that two things will get through your filter: golden-BB false negatives in exactly the configurations that fool all your checks, and true aligned AIs which you want.
But both corrigibility and perfect sovereign alignment are highly rare (corrigibility because it’s instrumentally anti-convergent, and perfect sovereign alignment because value is fragile), which means that your filter for misalignment is competing against that rarity to determine what comes out.
If P(golden-BB false negative) << P(alignment), all is well.
But if P(golden-BB false negative) >> P(alignment) despite your best efforts, then you just get golden-BB false negatives. Sure, they’re highly weird, but they’re still less weird than what you’re looking for and so you wind up creating them reliably when you try hard enough to get something that passes your filter.
I think the basic assumed argument here (though I’m not sure where or even if I’ve seen it explicitly laid out) goes essentially like this:
Using neural nets is more like the immune system’s “generate everything and filter out what doesn’t work” than it is like normal coding or construction. And there are limits on how much you can tamper with this, because the whole point of neural nets is that humans don’t know how to write code as good as neural nets—if we knew how to write such code deliberately, we wouldn’t need to use neural nets in the first place.
You hopefully have part of that filter designed to filter out misalignment. Presumably we agree that if you don’t have this, you are going to have a bad time.
This means that two things will get through your filter: golden-BB false negatives in exactly the configurations that fool all your checks, and true aligned AIs which you want.
But both corrigibility and perfect sovereign alignment are highly rare (corrigibility because it’s instrumentally anti-convergent, and perfect sovereign alignment because value is fragile), which means that your filter for misalignment is competing against that rarity to determine what comes out.
If P(golden-BB false negative) << P(alignment), all is well.
But if P(golden-BB false negative) >> P(alignment) despite your best efforts, then you just get golden-BB false negatives. Sure, they’re highly weird, but they’re still less weird than what you’re looking for and so you wind up creating them reliably when you try hard enough to get something that passes your filter.