I understand that the AGI may only infer the utility function from what happened before it’s existence. That seems to mean that it has to interpret evidence about past actions because there is no way to objectively access the past. For example a photo taken of something an agent did or the arrangement of their house now resulting from purchases they made. This indicates two ways of attack: Erasure of information and fake evidence. How does this approach deal with these?
I think your concern is a special case of this paragraph:
How might this protocol solve Inner Alignment? The only way to change our AGI’s actions is by changing its world model, because of its strict architecture that completely pins down a utility function to maximize (and the actions that maximize it) given a world model. So, allegedly, the only possible mesa-optimizers will take the form of acausal attackers (that is, simulation hypotheses), or at least something that can be very naturally modelled as an acausal attack (any false hypothesis about the world that changes the precursor that is chosen as the user, or a property relevant to actions maximizing their utility). And also allegedly, the methods implemented against radical acausal attacks will be sufficient to avoid this (and other less radical wrong hypotheses will be naturally dealt with by our AGI converging on the right physical world model).
We need to prevent our agent from developing false hypotheses because of adversarial inputs (through its sensors). You mention the particular case in which the false hypotheses are about the past (a part of physical reality), and adversarial input is provided as certain arrangements of present physical reality (which our AGI perceives through its sensors). These can be understood as very basic causal attacks. I guess all these cases are supposed to be dealt with by our AGI being capable enough (at modeling physical reality and updating its beliefs) so as to end up noticing the real past events. That is, given the messiness/inter-connectedness of physical reality (partaking in such procedures as “erasure of information” or “fake evidence” actually leave much physical traces that an intelligent enough agent could identify), these issues would probably fall on the side of “less radical wrong hypotheses”, and they are supposed to “be naturally dealt with by our AGI converging on the right physical world model”.
I agree that the interconnectedness of physical reality will leave traces—the question is: Enough? Can we put bounds on that?
I imagine blowing up a lot of stuff at once will destroy more than you can recover from elsewhere.
I am somewhat certain preDCA requires a specific human but there should be enough information recorded about anyone with a large enough digital footprint to reconstruct a plausible simulacra of them.
Keep in mind the ultimate goal is to get a good understanding of their preferences, not to actually recreate their entire existence with perfect fidelity.
PreDCA requires a human “user” to “be in the room” so that it is correctly identified as the “user”, but then only infers their utility from the actions they took before the AGI existed. This is achieved by inspecting the world model (which includes the past) on which the AGI converges. That is, the AGI is not “looking for traces of this person in the past”. It is reconstructing the whole past (and afterwards seeing what that person did there). Allegedly, if capabilities are high enough (to be dangerous), it will be able to reconstruct the past pretty accurately.
I guess the default answer would be that this is a problem for (the physical possibility of certain) capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered. In your example, if our model truly can’t converge with moderate accuracy to the right world model, we’d expect it to not have a clear understanding of the world around it, and so for instance be easily turned off.
That said, it might be interesting to more seriously consider whether efficient prediction of the past being literally physically impossible could make PreDCA slightly more dangerous for super-capable systems.
I understand that the AGI may only infer the utility function from what happened before it’s existence. That seems to mean that it has to interpret evidence about past actions because there is no way to objectively access the past. For example a photo taken of something an agent did or the arrangement of their house now resulting from purchases they made. This indicates two ways of attack: Erasure of information and fake evidence. How does this approach deal with these?
I think your concern is a special case of this paragraph:
We need to prevent our agent from developing false hypotheses because of adversarial inputs (through its sensors). You mention the particular case in which the false hypotheses are about the past (a part of physical reality), and adversarial input is provided as certain arrangements of present physical reality (which our AGI perceives through its sensors). These can be understood as very basic causal attacks. I guess all these cases are supposed to be dealt with by our AGI being capable enough (at modeling physical reality and updating its beliefs) so as to end up noticing the real past events. That is, given the messiness/inter-connectedness of physical reality (partaking in such procedures as “erasure of information” or “fake evidence” actually leave much physical traces that an intelligent enough agent could identify), these issues would probably fall on the side of “less radical wrong hypotheses”, and they are supposed to “be naturally dealt with by our AGI converging on the right physical world model”.
I agree that the interconnectedness of physical reality will leave traces—the question is: Enough? Can we put bounds on that? I imagine blowing up a lot of stuff at once will destroy more than you can recover from elsewhere.
I am somewhat certain preDCA requires a specific human but there should be enough information recorded about anyone with a large enough digital footprint to reconstruct a plausible simulacra of them.
Keep in mind the ultimate goal is to get a good understanding of their preferences, not to actually recreate their entire existence with perfect fidelity.
PreDCA requires a human “user” to “be in the room” so that it is correctly identified as the “user”, but then only infers their utility from the actions they took before the AGI existed. This is achieved by inspecting the world model (which includes the past) on which the AGI converges. That is, the AGI is not “looking for traces of this person in the past”. It is reconstructing the whole past (and afterwards seeing what that person did there). Allegedly, if capabilities are high enough (to be dangerous), it will be able to reconstruct the past pretty accurately.
I guess the default answer would be that this is a problem for (the physical possibility of certain) capabilities, and we are usually only concerned with our Alignment proposal working in the limit of high capabilities. Not (only) because we might think these capabilities will be achieved, but because any less capable system will a priori be less dangerous: it is way more likely that its capabilities fail in some non-interesting way (non-related to Alignment), or affect many other aspects of its performance (rendering it unable to achieve dangerous instrumental goals), than for capabilities to fail in just the right way so as for most of its potential achievements to remain untouched, but the goal relevantly altered. In your example, if our model truly can’t converge with moderate accuracy to the right world model, we’d expect it to not have a clear understanding of the world around it, and so for instance be easily turned off.
That said, it might be interesting to more seriously consider whether efficient prediction of the past being literally physically impossible could make PreDCA slightly more dangerous for super-capable systems.
Thanks for the long answer. I agree that my question is likely more tangential.