In my previous shortform, I used the phrase “attack vector”, borrowed from classical computer security. What does it mean to speak of an “attack vector” in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.
In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating the user or even by hacking the software or hardware of the system in some clever way. If the algorithm treats every answer as valid, this creates a perverse incentive: the AI knows that by phrasing the question in a particular way, a certain answer will result, so it will artificially obtain the answers that are preferable (for example answers that produce an easier to optimize utility function). In this interpretation the “attacker” is the AI itself. In order to defend against the vector, we might change the AI’s prior so that the AI knows some of the answers are invalid. If the AI has some method of distinguishing valid from invalid answers, that would eliminate the perverse incentive.
In the second interpretation, an attack vector is a vulnerability that can be exploited by malicious hypotheses in the AI’s prior. Such a hypothesis is an agent with its own goals (for example, it might arise as a simulation hypothesis). This agent intentionally drives the system to ask manipulative questions to further these goals. In order to defend, we might design the top level learning algorithm so that it only takes action that are safe with sufficiently high confidence (like in Delegative RL). If the prior contains a correct hypothesis along with the malicious hypothesis, the attack is deflected (since the correct hypothesis deems the action unsafe). Such a confidence threshold can usually be viewed as a computationally efficient implementation of the prior shaping described in the previous paragraph.
In the third interpretation, an attack vector is something that impedes you from proving a regret bound under sufficiently realistic assumptions. If your system has an undefended question interface, then proving a regret bound requires assuming that asking a question cannot create irreversible damage. In order to drop this assumption, a defense along the lines of the previous paragraphs has to be employed.
In my previous shortform, I used the phrase “attack vector”, borrowed from classical computer security. What does it mean to speak of an “attack vector” in the context of AI alignment? I use 3 different interpretations, which are mostly 3 different ways of looking at the same thing.
In the first interpretation, an attack vector is a source of perverse incentives. For example, if a learning protocol allows the AI to ask the user questions, a carefully designed question can artificially produce an answer we would consider invalid, for example by manipulating the user or even by hacking the software or hardware of the system in some clever way. If the algorithm treats every answer as valid, this creates a perverse incentive: the AI knows that by phrasing the question in a particular way, a certain answer will result, so it will artificially obtain the answers that are preferable (for example answers that produce an easier to optimize utility function). In this interpretation the “attacker” is the AI itself. In order to defend against the vector, we might change the AI’s prior so that the AI knows some of the answers are invalid. If the AI has some method of distinguishing valid from invalid answers, that would eliminate the perverse incentive.
In the second interpretation, an attack vector is a vulnerability that can be exploited by malicious hypotheses in the AI’s prior. Such a hypothesis is an agent with its own goals (for example, it might arise as a simulation hypothesis). This agent intentionally drives the system to ask manipulative questions to further these goals. In order to defend, we might design the top level learning algorithm so that it only takes action that are safe with sufficiently high confidence (like in Delegative RL). If the prior contains a correct hypothesis along with the malicious hypothesis, the attack is deflected (since the correct hypothesis deems the action unsafe). Such a confidence threshold can usually be viewed as a computationally efficient implementation of the prior shaping described in the previous paragraph.
In the third interpretation, an attack vector is something that impedes you from proving a regret bound under sufficiently realistic assumptions. If your system has an undefended question interface, then proving a regret bound requires assuming that asking a question cannot create irreversible damage. In order to drop this assumption, a defense along the lines of the previous paragraphs has to be employed.