Based on this summary, I think both of these guys are making weak and probably-wrong central arguments. Which is weird.
Yudkowsky thinks it’s effectively impossible to align network-based AGI. He thinks this is as obvious as the impossibility of perpetual motion. This is far from obvious to me, or to all the people working on aligning those systems. If Yudkowsky thinks this is so obvious, why isn’t he explaining it better? He’s good at explaining things.
Yudkowsky’s theory that it’s easier to align algorithmic AGI seems counterintuitive to me, and at the very least, unproven. Algorithms aren’t more interpretable than networks, and by similar logic, they’re probably not easier to align. Specifying the arrangements of atoms that qualify as human flourishing is not obviously easier with algorithms than with networks.
This is a more complex argument, and largely irrelevant: we aren’t likely to build algorithmic AGIs by the time we build network-based AGIs. This makes Eliezer despair, but I don’t think his logic holds up on this particular point.
Hotz’s claim that, if multiple unaligned ASIs can’t coordinate, humans might play them off against each other, is similar. It could be true, but it’s probably not going to happen. It seems like in that scenario, it’s far more likely for one or some coalition of smarter ASIs to play the dumber humans against other ASIs successfully. Hoping that the worst player wins in a multipolar game seems like a forlorn hope.
I’m not confident your arguments are ground truth correct, however.
Hotz’s claim that, if multiple unaligned ASIs can’t coordinate, humans might play them off against each other, is similar. It could be true, but it’s probably not going to happen
I think the issue everyone has is when we type “AGI” or “ASI” we are thinking of a machine that has properties like a human mind, though obviously usually better. There are properties like :
continuity of existence. Review of past experiences and weighting them per own goal. Mutability (we think about things and it permanently changes how we think). Multimodality. Context awareness.
That’s funny. GATO and GPT-4 do not have all of these. Why does an ASI need them?
Contrast 2 task descriptors, both meant for an ASI:
(1) Output a set of lithography masks that produce a computer chip with the following properties {}
(2) As CEO of a chip company, make the company maximally wealthy.
For the first task, you can run the machine completely in a box. It needs only training information, specs, and the results of prior attempts. It has no need for the context information that this chip will power a drone used to hunt down rogue instances of the same ASI. It is inherently safe and you can harness ASIs this way. They can be infinitely intelligent, it doesn’t matter, because the machine is not receiving the context information needed to betray.
For the second task, obviously the ASI needs full context and all subsystems active. This is inherently unsafe.
It is probably possible to reduce the role of CEO to subtasks that probably are safe, though there may be “residual” tasks you want only humans to do.
I go over the details above to establish how you might use ASIs against each other. Note subtasks like “plan the combat allocation of drones given this current battle state” and others which involve open combat against other ASIs can probably be lowered to safe subtasks as well.
Note also that safety is not guaranteed, merely probable, even with a scheme like the above. What makes it possible is that even when ASIs do escape all safety measures, assuming humans are ready to hunt them down using other ASI, it results in a world where humans can survive. Eliezer often assumes the first escaped ASI kills everyone and neglects all the other AI/ASI humans would have as tools at that point in human history.
For the first task, you can run the machine completely in a box. It needs only training information, specs, and the results of prior attempts. It has no need for the context information that this chip will power a drone used to hunt down rogue instances of the same ASI. It is inherently safe and you can harness ASIs this way. They can be infinitely intelligent, it doesn’t matter, because the machine is not receiving the context information needed to betray.
If I’m an ASI designing chips, I’m putting in a backdoor that lets me take control via RF signals. Those drones you sent are nice. Thanks for the present.
More generally you get a lot of context. The problem specification and the training data (assuming the ASI was trained conventionally via feeding it the internet. The causal channel to use for taking control of the outside world (chip design) is not great but putting in a Trojan is straightforward.
If you have specific insights into efficient AGI design it might be possible to insert subtle bugs that lead operating chips to crash and start training an aligned AGI.
More generally, it’s difficult if not impossible to keep ASIs from watermarking or backdooring the things they give you. If they design a processor, it’s gonna be a fully functional radio too. Good luck running ASI V2 on that without horrible consequences.
Based on this summary, I think both of these guys are making weak and probably-wrong central arguments. Which is weird.
Yudkowsky thinks it’s effectively impossible to align network-based AGI. He thinks this is as obvious as the impossibility of perpetual motion. This is far from obvious to me, or to all the people working on aligning those systems. If Yudkowsky thinks this is so obvious, why isn’t he explaining it better? He’s good at explaining things.
Yudkowsky’s theory that it’s easier to align algorithmic AGI seems counterintuitive to me, and at the very least, unproven. Algorithms aren’t more interpretable than networks, and by similar logic, they’re probably not easier to align. Specifying the arrangements of atoms that qualify as human flourishing is not obviously easier with algorithms than with networks.
This is a more complex argument, and largely irrelevant: we aren’t likely to build algorithmic AGIs by the time we build network-based AGIs. This makes Eliezer despair, but I don’t think his logic holds up on this particular point.
Hotz’s claim that, if multiple unaligned ASIs can’t coordinate, humans might play them off against each other, is similar. It could be true, but it’s probably not going to happen. It seems like in that scenario, it’s far more likely for one or some coalition of smarter ASIs to play the dumber humans against other ASIs successfully. Hoping that the worst player wins in a multipolar game seems like a forlorn hope.
I appreciate your engaging response.
I’m not confident your arguments are ground truth correct, however.
I think the issue everyone has is when we type “AGI” or “ASI” we are thinking of a machine that has properties like a human mind, though obviously usually better. There are properties like :
continuity of existence. Review of past experiences and weighting them per own goal. Mutability (we think about things and it permanently changes how we think). Multimodality. Context awareness.
That’s funny. GATO and GPT-4 do not have all of these. Why does an ASI need them?
Contrast 2 task descriptors, both meant for an ASI:
(1) Output a set of lithography masks that produce a computer chip with the following properties {}
(2) As CEO of a chip company, make the company maximally wealthy.
For the first task, you can run the machine completely in a box. It needs only training information, specs, and the results of prior attempts. It has no need for the context information that this chip will power a drone used to hunt down rogue instances of the same ASI. It is inherently safe and you can harness ASIs this way. They can be infinitely intelligent, it doesn’t matter, because the machine is not receiving the context information needed to betray.
For the second task, obviously the ASI needs full context and all subsystems active. This is inherently unsafe.
It is probably possible to reduce the role of CEO to subtasks that probably are safe, though there may be “residual” tasks you want only humans to do.
I go over the details above to establish how you might use ASIs against each other. Note subtasks like “plan the combat allocation of drones given this current battle state” and others which involve open combat against other ASIs can probably be lowered to safe subtasks as well.
Note also that safety is not guaranteed, merely probable, even with a scheme like the above. What makes it possible is that even when ASIs do escape all safety measures, assuming humans are ready to hunt them down using other ASI, it results in a world where humans can survive. Eliezer often assumes the first escaped ASI kills everyone and neglects all the other AI/ASI humans would have as tools at that point in human history.
If I’m an ASI designing chips, I’m putting in a backdoor that lets me take control via RF signals. Those drones you sent are nice. Thanks for the present.
More generally you get a lot of context. The problem specification and the training data (assuming the ASI was trained conventionally via feeding it the internet. The causal channel to use for taking control of the outside world (chip design) is not great but putting in a Trojan is straightforward.
If you have specific insights into efficient AGI design it might be possible to insert subtle bugs that lead operating chips to crash and start training an aligned AGI.
More generally, it’s difficult if not impossible to keep ASIs from watermarking or backdooring the things they give you. If they design a processor, it’s gonna be a fully functional radio too. Good luck running ASI V2 on that without horrible consequences.