Desiderata 1 and 2 seem to be the general non-negotiable goals of AI control (I call property 1 scalability, but many people have talked about it under different names.)
Why would a human not convert the future into paperclips if they wanted to maximize paperclips? This sounds structurally like “A human would not, in general, attempt to convert their entire future light cone into flourishing conscious experience.” But once we change the noun I’m not so sure.
Desiderata 3 and 4 also seem quite general; many (most?) people working on AI control aim to establish properties of this form.
Presumably the way to effect transparency under self-modification is to (1) ensure that transparent techniques are competitive with their opaque counterparts, and then (2) build systems that help humans get what they want in general, and so help the humans build understandable+capable AI systems as a special case.
Desiderata 1 and 2 seem to be the general non-negotiable goals of AI control (I call property 1 scalability, but many people have talked about it under different names.)
Why would a human not convert the future into paperclips if they wanted to maximize paperclips? This sounds structurally like “A human would not, in general, attempt to convert their entire future light cone into flourishing conscious experience.” But once we change the noun I’m not so sure.
Desiderata 3 and 4 also seem quite general; many (most?) people working on AI control aim to establish properties of this form.
Presumably the way to effect transparency under self-modification is to (1) ensure that transparent techniques are competitive with their opaque counterparts, and then (2) build systems that help humans get what they want in general, and so help the humans build understandable+capable AI systems as a special case.