The difficulties of making a copy don’t seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified.
If the way it’s used is by watching it and learning by example, then I don’t understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an “attack” seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]
The attacker hopes the watcher to “learn” that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.
Hmm, I see what you mean, but I prefer to ignore such “attack vectors” in my classification. Because, (i) it’s so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.
The difficulties of making a copy don’t seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified.
If the way it’s used is by watching it and learning by example, then I don’t understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an “attack” seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]
The attacker hopes the watcher to “learn” that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.
Hmm, I see what you mean, but I prefer to ignore such “attack vectors” in my classification. Because, (i) it’s so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.