It cannot be the case that successful value alignment requires perfect adversarial robustness.
It seems like the argument structure here is something like:
This requirement is too stringent for humans to follow
Humans have successful value alignment
Therefore this requirement cannot be necessary for successful value alignment.
I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to.
So any reasoning that’s like “well so long as it’s not unusual we can be sure it’s safe” runs into the thing where we’re living in the acute risk period. The usual is not safe!
Similarly, an AI that knows it’s vulnerable to adversarial attacks, and wants to avoid being attacked successfully, will take steps to protect itself against such attacks. I think creating AIs with such meta-preferences is far easier than creating AIs that are perfectly immune to all possible adversarial attacks.
This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is.
[FWIW I also don’t think we want an AI that’s perfectly robust to all possible adversarial attacks; I think we want one that’s adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I’m mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]
It seems like the argument structure here is something like:
This requirement is too stringent for humans to follow
Humans have successful value alignment
Therefore this requirement cannot be necessary for successful value alignment.
I disagree with point 2, tho; among other things, it looks to me like some humans are on track to accidentally summoning a demon that kills both me and them, which I expect they would regret after-the-fact if they had the chance to.
So any reasoning that’s like “well so long as it’s not unusual we can be sure it’s safe” runs into the thing where we’re living in the acute risk period. The usual is not safe!
This seems definitely right to me. An expectation I have is that this will also generate resistance to alignment techniques / control by its operators, which perhaps complicates how benign this is.
[FWIW I also don’t think we want an AI that’s perfectly robust to all possible adversarial attacks; I think we want one that’s adequate to defend against the security challenges it faces, many of which I expect to be internal. Part of this is because I’m mostly interested in AI planning systems able to help with transformative changes to the world instead of foundational models used by many customers for small amounts of cognition, which are totally different business cases and have different security problems.]