here is a list of problems which i seek to either resolve or get around, in order to implement my formal alignment plans, especially QACI:
formal inner alignment: in the formal alignment paradigm, “inner alignment” means refers to the problem of building an AI which, when ran, actually maximizes the formal goal we give it (in tractable time) rather than doing something else such as getting hijacked by an unaligned internal component of itself. because its goal is formal and fully general, it feels like building something that maximizes it should be much easier than the regular kind of inner alignment, and we could have a lot more confidence in the resulting system. (progress on this problem could be capability-exfohazardous, however!)
continuous alignment: given a utility function which is theoretically eventually aligned such that there exists a level of capabilities at which it has good outcomes for any level above it, how do we bridge the gap from where we are to that level? will a system “accidentally” destroy all values before realizing it shouldn’t have done that?
blob location: for QACI, how do we robustly locate pieces of data stored on computers encoded on top of bottom-level-physics turing-machine solomonoff hypotheses for the world? see 1, 2, 3 for details.
physics embedding: related to the previous problem, how precisely does the prior we’re using need to capture our world, for the intended instance of the blobs to be locatable? can we just find the blobs in the universal program — or, if P≠BQP, some universal quantum program? do we need to demand worlds to contain, say, a dump of wikipedia to count as ours? can we use the location of such a dump as a prior for the location of the blobs?
infrastructure design: what formal-math language will the formal goal be expressed in? what kind of properties should it have? should it include some kind of proving system, and in what logic? in QACI, will this also be the language for the user’s answer? what kind of checksums should accompany the question and answer blobs? these questions are at this stage premature, but they will need some figuring out at some point if formal alignment is, as i currently believe, the way to go.
(this answer is cross-posted on my blog)
here is a list of problems which i seek to either resolve or get around, in order to implement my formal alignment plans, especially QACI:
formal inner alignment: in the formal alignment paradigm, “inner alignment” means refers to the problem of building an AI which, when ran, actually maximizes the formal goal we give it (in tractable time) rather than doing something else such as getting hijacked by an unaligned internal component of itself. because its goal is formal and fully general, it feels like building something that maximizes it should be much easier than the regular kind of inner alignment, and we could have a lot more confidence in the resulting system. (progress on this problem could be capability-exfohazardous, however!)
continuous alignment: given a utility function which is theoretically eventually aligned such that there exists a level of capabilities at which it has good outcomes for any level above it, how do we bridge the gap from where we are to that level? will a system “accidentally” destroy all values before realizing it shouldn’t have done that?
blob location: for QACI, how do we robustly locate pieces of data stored on computers encoded on top of bottom-level-physics turing-machine solomonoff hypotheses for the world? see 1, 2, 3 for details.
physics embedding: related to the previous problem, how precisely does the prior we’re using need to capture our world, for the intended instance of the blobs to be locatable? can we just find the blobs in the universal program — or, if P≠BQP, some universal quantum program? do we need to demand worlds to contain, say, a dump of wikipedia to count as ours? can we use the location of such a dump as a prior for the location of the blobs?
infrastructure design: what formal-math language will the formal goal be expressed in? what kind of properties should it have? should it include some kind of proving system, and in what logic? in QACI, will this also be the language for the user’s answer? what kind of checksums should accompany the question and answer blobs? these questions are at this stage premature, but they will need some figuring out at some point if formal alignment is, as i currently believe, the way to go.