AGI-assisted alignment in Dath Ilan (excerpt from here)
Suppose Dath Ilan got into a situation where they had to choose the strategy of AGI-assisted alignment, and didn’t have more than a few years to prepare. Dath Ilan wouldn’t actually get themselves into such a situation, but if they did, how might they go about it?
I suspect that among other things they would:
Make damn well sure to box the AGIs before it plausibly could become dangerous/powerful.
Try, insofar as they could, to make their methodologies robust to hardware exploits (rowhammer, etc). Not only by making hardware exploits hard, but by thinking about which code they ran on which computers and so on.
Limit communication bandwidth for AGIs (they might think of having humans directly exposed to AGI communication as “touching the lava”, and try to obtain help with alignment-work while touching lava as little as possible).
Insofar as they saw a need to expose humans to AGI communication, those humans would be sealed off from society (and the AGIs communication would be heavily restricted). The idea of having operators themselves be exposed to AGI-generated content from superhuman AGIs they don’t trust to be aligned—in Dath Ilan such an approach would have been seen as outside the realm of consideration.
Humans would be exposed to AGI communication mostly so as to test the accuracy of systems that predict human evaluations (if they don’t test humans on AGI-generated content, they’re not testing the full range of outputs). But even this they would also try to get around / minimize (by, among other things, using techniques such as the ones I summarize here).
In Dath Ilani, it would be seen as a deranged idea to have humans evaluate arguments (or predict how humans would evaluate arguments), and just trust arguments that seem good to humans. To them, that would be kind of like trying to make a water-tight basket, but never trying to fill it with water (to see if water leaked through). Instead, they would use techniques such as the ones I summarize here.
When forced to confront chicken and egg problems, they would see this as a challenge (after all, chickens exist, and eggs exist—so it’s not as if chicken and egg problems never have solutions).
AGI-assisted alignment in Dath Ilan (excerpt from here)
Suppose Dath Ilan got into a situation where they had to choose the strategy of AGI-assisted alignment, and didn’t have more than a few years to prepare. Dath Ilan wouldn’t actually get themselves into such a situation, but if they did, how might they go about it?
I suspect that among other things they would:
Make damn well sure to box the AGIs before it plausibly could become dangerous/powerful.
Try, insofar as they could, to make their methodologies robust to hardware exploits (rowhammer, etc). Not only by making hardware exploits hard, but by thinking about which code they ran on which computers and so on.
Limit communication bandwidth for AGIs (they might think of having humans directly exposed to AGI communication as “touching the lava”, and try to obtain help with alignment-work while touching lava as little as possible).
Insofar as they saw a need to expose humans to AGI communication, those humans would be sealed off from society (and the AGIs communication would be heavily restricted). The idea of having operators themselves be exposed to AGI-generated content from superhuman AGIs they don’t trust to be aligned—in Dath Ilan such an approach would have been seen as outside the realm of consideration.
Humans would be exposed to AGI communication mostly so as to test the accuracy of systems that predict human evaluations (if they don’t test humans on AGI-generated content, they’re not testing the full range of outputs). But even this they would also try to get around / minimize (by, among other things, using techniques such as the ones I summarize here).
In Dath Ilani, it would be seen as a deranged idea to have humans evaluate arguments (or predict how humans would evaluate arguments), and just trust arguments that seem good to humans. To them, that would be kind of like trying to make a water-tight basket, but never trying to fill it with water (to see if water leaked through). Instead, they would use techniques such as the ones I summarize here.
When forced to confront chicken and egg problems, they would see this as a challenge (after all, chickens exist, and eggs exist—so it’s not as if chicken and egg problems never have solutions).