I’m still not following. What does acting in two places at once has to do with alignment? What does it mean “can be rearranged to construct automatons”?
Imagine a game that takes place in a simulated universe where you control a character that can manipulate its environment. You control your character through a cartesian boundary, and you expect there are other player-controlled characters far away. There’s a lightspeed limit and you can build machines and computers; you could build Von Neumann machines and send them out, but they need to be able to respond to various encounters. Ideally you’d go yourself, but you can’t be everywhere at once. Therefore you are incentivized to solve the alignment problem in order to write subagents to send along. We can simplify this game a lot while preserving that incentive.
I don’t think this game will help, because the winning strategy is just making copies of yourself. I can imagine something else along similar lines: we create virtual universes populated by agents with random utility functions and give the agent-in-training the task of learning the other agents’ utility functions. Presumably you can then deploy the resulting agent into the real world and make it learn from humans. However, this system is at least class III, because in the deployment phase you allow inputs from the physical world. Moreover, if there is some way to distinguish between the virtual worlds and the real world, it becomes at least class IV.
Making copies of yourself is not trivial when you’re behind a cartesian boundary and have no sensors on yourself. The reasoning for why it’s class I is that we merely watch the agent in order to learn by example how to build an AI with our utility function, aka a copy of ourselves.
The difficulties of making a copy don’t seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified.
If the way it’s used is by watching it and learning by example, then I don’t understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an “attack” seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]
The attacker hopes the watcher to “learn” that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.
Hmm, I see what you mean, but I prefer to ignore such “attack vectors” in my classification. Because, (i) it’s so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.
I’m still not following. What does acting in two places at once has to do with alignment? What does it mean “can be rearranged to construct automatons”?
Imagine a game that takes place in a simulated universe where you control a character that can manipulate its environment. You control your character through a cartesian boundary, and you expect there are other player-controlled characters far away. There’s a lightspeed limit and you can build machines and computers; you could build Von Neumann machines and send them out, but they need to be able to respond to various encounters. Ideally you’d go yourself, but you can’t be everywhere at once. Therefore you are incentivized to solve the alignment problem in order to write subagents to send along. We can simplify this game a lot while preserving that incentive.
I don’t think this game will help, because the winning strategy is just making copies of yourself. I can imagine something else along similar lines: we create virtual universes populated by agents with random utility functions and give the agent-in-training the task of learning the other agents’ utility functions. Presumably you can then deploy the resulting agent into the real world and make it learn from humans. However, this system is at least class III, because in the deployment phase you allow inputs from the physical world. Moreover, if there is some way to distinguish between the virtual worlds and the real world, it becomes at least class IV.
Making copies of yourself is not trivial when you’re behind a cartesian boundary and have no sensors on yourself. The reasoning for why it’s class I is that we merely watch the agent in order to learn by example how to build an AI with our utility function, aka a copy of ourselves.
The difficulties of making a copy don’t seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified.
If the way it’s used is by watching it and learning by example, then I don’t understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an “attack” seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]
The attacker hopes the watcher to “learn” that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.
Hmm, I see what you mean, but I prefer to ignore such “attack vectors” in my classification. Because, (i) it’s so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.