We design user detection so that anything below a threshold is a “user”
Yes, but simulators might not just “alter reality so that they are slightly more causally tight than the user”, they might even “alter reality so that they are inside the threshold and the user no longer is”, right? I guess that’s why you mention some filtering is still needed.
I believe that deep learning has theoretical guarantees, we just don’t know what they are
I understand now. I guess my point would then be restated as: given the amount of room that simulators (intuitively seem to) have to trick the AGI (even with all the above developments), it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks. Not because of formal guarantees being impossible to prove about that training procedure (e.g. DL), but because pruning attacks from the space of hypotheses is too complicated of a search for any human-made algorithm/procedure to carry out (because of the variety of attacks and the vastness of the space of hypotheses).
We can’t just “rearrange the program in a way that outputs information that wasn’t already there”, because if it isn’t already there, the bridge transform will not assert this rearranged program is running.
Yes, but simulators might not just “alter reality so that they are slightly more causally tight than the user”, they might even “alter reality so that they are inside the threshold and the user no longer is”, right?
No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it’s falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it’s not possible for the attacker to make the user-AI loop less tight.
...it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks… because of the variety of attacks and the vastness of the space of hypotheses.
The variety of attacks doesn’t imply the impossibility of defending from them. In cryptography, we have protocols immune from all attacks[1] despite a vast space of possible attacks. Similarly, here I’m hoping to gradually transform the informal arguments above into a rigorous theorem (or well-supported conjecture) that the system is immune.
No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it’s falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it’s not possible for the attacker to make the user-AI loop less tight.
Yes, I had understood that, but this is only the case in the limit when the AI is completely certain about every minute detail about its immediate physical reality, right? Otherwise, as in my above example, the simulator could introduce microscopic variations (wherever the AI isn’t yet completely certain about reality, for instance in some parts of the user’s brain) which subtly alter reality in such a way that the information between AI and user from counterfactual actions takes longer to arrive. Or am I missing something?
The variety of attacks doesn’t imply the impossibility of defending from them.
If the information takes a little longer to arrive, then the user will still be inside the threshold.
A more concerning problem is, what if the simulation only contains a coarse grained simulation of the user s.t. it doesn’t register as an agent. To account for this, we might need to define a notion of “coarse grained agent” and allow such entities to be candidate users. Or, maybe any coarse grained agent has to be an actual agent with a similar loss function, in which case everything works out on its own. These are nuances that probably require uncovering more of the math to understand properly.
Oh, so it seems we need a coarse grained user (a vague enough physical realization of the user) for threshold problems to arise. I understand now, thank you again!
Thank you again for answering!
Yes, but simulators might not just “alter reality so that they are slightly more causally tight than the user”, they might even “alter reality so that they are inside the threshold and the user no longer is”, right? I guess that’s why you mention some filtering is still needed.
I understand now. I guess my point would then be restated as: given the amount of room that simulators (intuitively seem to) have to trick the AGI (even with all the above developments), it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks. Not because of formal guarantees being impossible to prove about that training procedure (e.g. DL), but because pruning attacks from the space of hypotheses is too complicated of a search for any human-made algorithm/procedure to carry out (because of the variety of attacks and the vastness of the space of hypotheses).
Of course! I understand now, thank you.
No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it’s falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it’s not possible for the attacker to make the user-AI loop less tight.
The variety of attacks doesn’t imply the impossibility of defending from them. In cryptography, we have protocols immune from all attacks[1] despite a vast space of possible attacks. Similarly, here I’m hoping to gradually transform the informal arguments above into a rigorous theorem (or well-supported conjecture) that the system is immune.
As long as the assumptions of the model hold, ofc. And, assuming some (highly likely) complexity-theoretic conjectures.
Yes, I had understood that, but this is only the case in the limit when the AI is completely certain about every minute detail about its immediate physical reality, right? Otherwise, as in my above example, the simulator could introduce microscopic variations (wherever the AI isn’t yet completely certain about reality, for instance in some parts of the user’s brain) which subtly alter reality in such a way that the information between AI and user from counterfactual actions takes longer to arrive. Or am I missing something?
You’re right, thank you!
If the information takes a little longer to arrive, then the user will still be inside the threshold.
A more concerning problem is, what if the simulation only contains a coarse grained simulation of the user s.t. it doesn’t register as an agent. To account for this, we might need to define a notion of “coarse grained agent” and allow such entities to be candidate users. Or, maybe any coarse grained agent has to be an actual agent with a similar loss function, in which case everything works out on its own. These are nuances that probably require uncovering more of the math to understand properly.
Oh, so it seems we need a coarse grained user (a vague enough physical realization of the user) for threshold problems to arise. I understand now, thank you again!