D-imitations and DD-imitations robustly preserve the goodness of the people being imitated, despite the imperfection of the imitation;
My model of Paul thinks it’s sufficient to train the AI’s to be corrigibleact-based assistants that are competent enough to help us significantly, while also able to avoid catastrophes. If possible, this would allow significant wiggle room for imperfect imitation.
Paul and I disagreed about the ease of training such assistants, and we hashed out a specific thought experiment: if we humans were trying our hardest to be competent, catastrophe-free, corrigible act-based assistants to some aliens, is there some reasonable training procedure they could give us that would enable us to significantly and non-catastrophically assist the aliens perform a pivotal act? Paul thought yes (IIRC), while I felt iffy about it. After all, we might need to understand tons and tons of alien minutiae to avoid any catastrophes, and given how different our cultures (and brains) are from theirs, it seems unlikely we’d be able to capture all the relevant minutiae.
I’ve since warmed up to the feasibility of this. It seems like there aren’t too many ways to cause existential catastrophes, it’s pretty easy to determine what things constitute existential catastrophes, and it’s pretty easy to spot them in advance (at least as well as the aliens would). Yes we might still make some catastrophic mistakes, but they’re likely to be benign, and it’s not clear that the risk of catastrophe we’d incur is much worse than the risk the aliens would incur if a large team of them tried to execute a pivotal act. Perhaps there’s still room for things like accidental mass manipulation, but this feels much less worrisome than existential catastrophe (and also seems plausibly preventable with a sufficiently competent operator).
I suspect another major crux on this point is whether there is a broad basin of corrigibility (link). If so, it shouldn’t be too hard for D-imitations to be corrigible, nor for IDA to preserve corrigibility for DD-imitations. If not, it seems likely that corrigibility would be lost through distillation. I think this is also a crux for Vaniver in his post about his confusions with Paul’s agenda.
My model of Paul thinks it’s sufficient to train the AI’s to be corrigible act-based assistants that are competent enough to help us significantly, while also able to avoid catastrophes. If possible, this would allow significant wiggle room for imperfect imitation.
Paul and I disagreed about the ease of training such assistants, and we hashed out a specific thought experiment: if we humans were trying our hardest to be competent, catastrophe-free, corrigible act-based assistants to some aliens, is there some reasonable training procedure they could give us that would enable us to significantly and non-catastrophically assist the aliens perform a pivotal act? Paul thought yes (IIRC), while I felt iffy about it. After all, we might need to understand tons and tons of alien minutiae to avoid any catastrophes, and given how different our cultures (and brains) are from theirs, it seems unlikely we’d be able to capture all the relevant minutiae.
I’ve since warmed up to the feasibility of this. It seems like there aren’t too many ways to cause existential catastrophes, it’s pretty easy to determine what things constitute existential catastrophes, and it’s pretty easy to spot them in advance (at least as well as the aliens would). Yes we might still make some catastrophic mistakes, but they’re likely to be benign, and it’s not clear that the risk of catastrophe we’d incur is much worse than the risk the aliens would incur if a large team of them tried to execute a pivotal act. Perhaps there’s still room for things like accidental mass manipulation, but this feels much less worrisome than existential catastrophe (and also seems plausibly preventable with a sufficiently competent operator).
I suspect another major crux on this point is whether there is a broad basin of corrigibility (link). If so, it shouldn’t be too hard for D-imitations to be corrigible, nor for IDA to preserve corrigibility for DD-imitations. If not, it seems likely that corrigibility would be lost through distillation. I think this is also a crux for Vaniver in his post about his confusions with Paul’s agenda.