It seems to me that systems which have no access to data with rich information about the physical world are mostly safe (I called such systems “Class I” here). Such a system cannot attack because it has no idea how to physical world looks like. In principle we could imagine an attack that would work in most locations in the multiverse that are metacosmologically plausible, but it doesn’t seem very likely.
Can you train a system to prove theorems without providing any data about the physical world? This depends from which distribution you sample your theorems. If we’re talking about something like, uniform sentences of given length in the language of ZFC then, yes, we can. However, proving such theorems is very hard, and whatever progress you can make there doesn’t necessarily help with proving interesting theorems.
Human mathematicians probably can only solve some rather narrow type of theorems. We can try training the AI on theorems selected by interest to human mathematicians, but then we risk leaking information about the physical world. Alternatively, the class of humanly-solvable-theorems might be close to something natural and not human specific, in which case a theorem prover can be class I. But, designing such a theorem prover would require us to first discover the specification of this natural class.
You’d also need to prevent the system from knowing too much about its own source code or the computers it was running on. Anyways, this seems to me to mostly fall prey to the safe-but-useless branch of the dilemma; I don’t know how to save the world using a theorem-prover that is never exposed to any reality-contaminated theorems. It seems strategically isomorphic to an expensive rock.
In general, yes, although we could imagine an AI and/or virtual machine whose design is so simple that it conveys little evidence about the universe. But, sure, it’s not at all clear that this is useful against AI risk, and I wasn’t implying otherwise.
[EDIT: I amended the class system to account for this.]
Here’s an example: You train an AI for the simplest game that requires an aligned subagent to win. The AI infers that whoever is investigating the alignment problem might watch its universe. It therefore designs its subagent to, as a matter of acausal self-preservation, help whatever deliberately brought it about. Copycats will find that their AGI identifies as “whatever deliberately brought it about” the AI that launched this memetic attack on the multiverse. Any lesser overseer AI, less able to design attacks but still able to recognize them, recognizes that its recognition qualifies it as a deliberate bringer-about.
I’m still not following. What does acting in two places at once has to do with alignment? What does it mean “can be rearranged to construct automatons”?
Imagine a game that takes place in a simulated universe where you control a character that can manipulate its environment. You control your character through a cartesian boundary, and you expect there are other player-controlled characters far away. There’s a lightspeed limit and you can build machines and computers; you could build Von Neumann machines and send them out, but they need to be able to respond to various encounters. Ideally you’d go yourself, but you can’t be everywhere at once. Therefore you are incentivized to solve the alignment problem in order to write subagents to send along. We can simplify this game a lot while preserving that incentive.
I don’t think this game will help, because the winning strategy is just making copies of yourself. I can imagine something else along similar lines: we create virtual universes populated by agents with random utility functions and give the agent-in-training the task of learning the other agents’ utility functions. Presumably you can then deploy the resulting agent into the real world and make it learn from humans. However, this system is at least class III, because in the deployment phase you allow inputs from the physical world. Moreover, if there is some way to distinguish between the virtual worlds and the real world, it becomes at least class IV.
Making copies of yourself is not trivial when you’re behind a cartesian boundary and have no sensors on yourself. The reasoning for why it’s class I is that we merely watch the agent in order to learn by example how to build an AI with our utility function, aka a copy of ourselves.
The difficulties of making a copy don’t seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified.
If the way it’s used is by watching it and learning by example, then I don’t understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an “attack” seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]
The attacker hopes the watcher to “learn” that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.
Hmm, I see what you mean, but I prefer to ignore such “attack vectors” in my classification. Because, (i) it’s so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.
Comment after reading section 1.1:
It seems to me that systems which have no access to data with rich information about the physical world are mostly safe (I called such systems “Class I” here). Such a system cannot attack because it has no idea how to physical world looks like. In principle we could imagine an attack that would work in most locations in the multiverse that are metacosmologically plausible, but it doesn’t seem very likely.
Can you train a system to prove theorems without providing any data about the physical world? This depends from which distribution you sample your theorems. If we’re talking about something like, uniform sentences of given length in the language of ZFC then, yes, we can. However, proving such theorems is very hard, and whatever progress you can make there doesn’t necessarily help with proving interesting theorems.
Human mathematicians probably can only solve some rather narrow type of theorems. We can try training the AI on theorems selected by interest to human mathematicians, but then we risk leaking information about the physical world. Alternatively, the class of humanly-solvable-theorems might be close to something natural and not human specific, in which case a theorem prover can be class I. But, designing such a theorem prover would require us to first discover the specification of this natural class.
You’d also need to prevent the system from knowing too much about its own source code or the computers it was running on. Anyways, this seems to me to mostly fall prey to the safe-but-useless branch of the dilemma; I don’t know how to save the world using a theorem-prover that is never exposed to any reality-contaminated theorems. It seems strategically isomorphic to an expensive rock.
In general, yes, although we could imagine an AI and/or virtual machine whose design is so simple that it conveys little evidence about the universe. But, sure, it’s not at all clear that this is useful against AI risk, and I wasn’t implying otherwise.
[EDIT: I amended the class system to account for this.]
Here’s an example: You train an AI for the simplest game that requires an aligned subagent to win. The AI infers that whoever is investigating the alignment problem might watch its universe. It therefore designs its subagent to, as a matter of acausal self-preservation, help whatever deliberately brought it about. Copycats will find that their AGI identifies as “whatever deliberately brought it about” the AI that launched this memetic attack on the multiverse. Any lesser overseer AI, less able to design attacks but still able to recognize them, recognizes that its recognition qualifies it as a deliberate bringer-about.
I’m not following at all. This is an example of what? What does it mean to have a game that requires an aligned subagent to win?
This is an example of an attack that a Class I system might devise.
Such a game might have the AI need to act intelligently in two places at once in a world that can be rearranged to construct automatons.
I’m still not following. What does acting in two places at once has to do with alignment? What does it mean “can be rearranged to construct automatons”?
Imagine a game that takes place in a simulated universe where you control a character that can manipulate its environment. You control your character through a cartesian boundary, and you expect there are other player-controlled characters far away. There’s a lightspeed limit and you can build machines and computers; you could build Von Neumann machines and send them out, but they need to be able to respond to various encounters. Ideally you’d go yourself, but you can’t be everywhere at once. Therefore you are incentivized to solve the alignment problem in order to write subagents to send along. We can simplify this game a lot while preserving that incentive.
I don’t think this game will help, because the winning strategy is just making copies of yourself. I can imagine something else along similar lines: we create virtual universes populated by agents with random utility functions and give the agent-in-training the task of learning the other agents’ utility functions. Presumably you can then deploy the resulting agent into the real world and make it learn from humans. However, this system is at least class III, because in the deployment phase you allow inputs from the physical world. Moreover, if there is some way to distinguish between the virtual worlds and the real world, it becomes at least class IV.
Making copies of yourself is not trivial when you’re behind a cartesian boundary and have no sensors on yourself. The reasoning for why it’s class I is that we merely watch the agent in order to learn by example how to build an AI with our utility function, aka a copy of ourselves.
The difficulties of making a copy don’t seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified.
If the way it’s used is by watching it and learning by example, then I don’t understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an “attack” seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]
The attacker hopes the watcher to “learn” that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.
Hmm, I see what you mean, but I prefer to ignore such “attack vectors” in my classification. Because, (i) it’s so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.