About the “lock-in” problem: I don’t think lock-in is a meaningful concern. I’m an agent with specific preferences, and I’m making decision based on these preferences. The decision which AI to run is just another decision. Hence, there’s no philosophical reason I shouldn’t make it based on my current preferences. The confusion leading to viewing this as a problem comes from conflating long-term terminal preferences with short-term instrumental preferences based on some possibly erroneous beliefs. Notice also that in IBP, preferences can directly depend on computations, so they can be abstract / “meta” / indirect.
About “we’re not really checking which computations run, but which computations’ outputs the universe has information about… this is a problem to fix”. I don’t think it’s a problem. The former is the same as the latter. If, like in your example, P is a program that outputs a bit and Q is a program that outputs not-P, then P is running iff Q is running (as long as we’re at an epistemic vantage point that knows that Q = not-P). This seems intuitive to me: it makes no sense to distinguish between computing the millionth digit of pi in binary, and computing the not of the millionth digit of pi in binary. It’s essentially the same computation with different representation of the result. To put it differently, if P is an uploaded human and Q is a different program which I know to be functionally equivalent to P, then Q is also considered to be an uploaded human. This is a philosophical commitment, but I consider it to be pretty reasonable.
About “the current framework only allows for our AGI to give positive value to computations”: yes, this is a major problem. There might be good answers to this, but currently all the candidate answers I know are pretty weird (i.e. require biting some philosophical bullet). I believe that we’ll understand more one way or the other as we progress in our mathematical inquiry.
About “a model of human cognitive science” and “pruning mechanisms”: I no longer believe these are necessary. I now think we don’t need to explicitly filter acausal attackers. Instead, in IBP every would-be mesa-optimizer is toothless because it automatically has to contain a simulation of the user and therefore it is (i) a valid hypothesis from the user’s POV and (ii) induces the correct user preferences.
About the “lock-in” problem: I don’t think lock-in is a meaningful concern
I understand you only care about maximizing your current preferences (which might include long-term flourishing of humanity), and not some vague “longtermist potential” independent of your preferences. I agree, but it would seem like mostEAs would disagree (or maybe this point just hasn’t been driven home for them yet).
About “a model of human cognitive science” and “pruning mechanisms”: I no longer believe these are necessary
That’s interesting, thank you! I’ll give some thought to whether, even if this development holds, the massive search might have sneaked in some other avenue. Even without a coarse-grained simulated user, I don’t immediately see why some simulation hypotheses (maybe specifically tailored to the way in which the AI encodes its physical hypotheses) would not be able to alter underlying physics in such a way as to provide a tighter causal loop between AI and simulator, so that User Detection yields a simulator. More concretely: a simulator might introduce microscopic variations in the simulation (affecting the AI’s perceptions) depending on its moment to moment behavior, and also perceive the AI’s outputs “even faster” than the simulated human user does (on the simulator’s world, maybe just by slowing down the simulation?).
To put it differently, if P is an uploaded human and Q is a different program which I know to be functionally equivalent to P, then Q is also considered to be an uploaded human.
Say P searches for a model of a theory T. Say Q simulates a room with a human, and a computer which distributes an electric shock to the human iff it finds a contradiction derived from T, and Q outputs whether the human screamed in pain (and suppose the human screams in pain iff they are shocked). Both reject at time t if they haven’t accepted yet, but suppose we know one of the two searches will finish before t.
I guess you will tell me “even if P = not-Q, the programs are not functionally equivalent, because the first carries more information (for instance, from it way more information can be computed, if we rearrange what it chooses as output, or similarly peek into its computations)”. But where is the boundary drawn between “rearranging what the program outputs or peeking into it to extract more information which was already there” and “rearranging the program in a way that outputs information that wasn’t already there, or peeking and processing what we see to learn information that wasn’t already there”?
I understand you only care about maximizing your current preferences (which might include long-term flourishing of humanity), and not some vague “longtermist potential” independent of your preferences. I agree, but it would seem like most EAs would disagree
Yes, I think most EAs are confused about ethics (see e.g. 123), which is why I’m not sure I count as EA or merely as “EA-adjacent”[1].
I don’t immediately see why some simulation hypotheses (maybe specifically tailored to the way in which the AI encodes its physical hypotheses) would not be able to alter underlying physics in such a way as to provide a tighter causal loop between AI and simulator, so that User Detection yields a simulator
We design user detection so that anything below a threshold is a “user” (rather than only the extreme agent being the user), and if there are multiple (or no) “users” we discard the hypothesis. So, yes, there is still some filtering going on, just not as complex as before.
But maybe this is not your main concern. You said “our best plan at avoiding those is applying various approximate pruning mechanisms (as already happens in many other Alignment proposals)”. This is not how I would put it. My goal is to have an algorithm for which we know its theoretical guarantees (e.g. having such-and-such regret bound w.r.t. such-and-such prior). I believe that deep learning has theoretical guarantees, we just don’t know what they are. We will need to either (i) understand what guarantees DL has and how to modify it in order to impose the guarantees we actually want or (ii) come up with a completely different algorithm for satisfying the new guarantees. Either will be challenging, but there are reasons to be cautiously optimistic (mainly the progress that’s already happening, and the fact that looking for algorithms is easier when you know exactly which mathematical property you need).
Say P searches for a model of a theory T. Say Q simulates a room with a human, and a computer which distributes an electric shock to the human iff it finds a contradiction derived from T, and Q outputs whether the human screamed in pain (and suppose the human screams in pain iff they are shocked). Both reject at time t if they haven’t accepted yet, but suppose we know one of the two searches will finish before t.
The difference is, if there’s actually a room with a human (or the simulation of a room with a human), then there are other computations that are running (all the outputs of the human throughout the process), not just the one bit about whether the human screamed in pain. That’s how we know that in this situation a human exists, whereas if we only have a computer running P then no human exists. We can’t just “rearrange the program in a way that outputs information that wasn’t already there”, because if it isn’t already there, the bridge transform will not assert this rearranged program is running.
Due to the unfortunate timing of this discussion, I feel the need to clarify: this has absolutely nothing to do with FTX. I would have said exactly the same before those recent events.
We design user detection so that anything below a threshold is a “user”
Yes, but simulators might not just “alter reality so that they are slightly more causally tight than the user”, they might even “alter reality so that they are inside the threshold and the user no longer is”, right? I guess that’s why you mention some filtering is still needed.
I believe that deep learning has theoretical guarantees, we just don’t know what they are
I understand now. I guess my point would then be restated as: given the amount of room that simulators (intuitively seem to) have to trick the AGI (even with all the above developments), it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks. Not because of formal guarantees being impossible to prove about that training procedure (e.g. DL), but because pruning attacks from the space of hypotheses is too complicated of a search for any human-made algorithm/procedure to carry out (because of the variety of attacks and the vastness of the space of hypotheses).
We can’t just “rearrange the program in a way that outputs information that wasn’t already there”, because if it isn’t already there, the bridge transform will not assert this rearranged program is running.
Yes, but simulators might not just “alter reality so that they are slightly more causally tight than the user”, they might even “alter reality so that they are inside the threshold and the user no longer is”, right?
No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it’s falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it’s not possible for the attacker to make the user-AI loop less tight.
...it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks… because of the variety of attacks and the vastness of the space of hypotheses.
The variety of attacks doesn’t imply the impossibility of defending from them. In cryptography, we have protocols immune from all attacks[1] despite a vast space of possible attacks. Similarly, here I’m hoping to gradually transform the informal arguments above into a rigorous theorem (or well-supported conjecture) that the system is immune.
No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it’s falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it’s not possible for the attacker to make the user-AI loop less tight.
Yes, I had understood that, but this is only the case in the limit when the AI is completely certain about every minute detail about its immediate physical reality, right? Otherwise, as in my above example, the simulator could introduce microscopic variations (wherever the AI isn’t yet completely certain about reality, for instance in some parts of the user’s brain) which subtly alter reality in such a way that the information between AI and user from counterfactual actions takes longer to arrive. Or am I missing something?
The variety of attacks doesn’t imply the impossibility of defending from them.
If the information takes a little longer to arrive, then the user will still be inside the threshold.
A more concerning problem is, what if the simulation only contains a coarse grained simulation of the user s.t. it doesn’t register as an agent. To account for this, we might need to define a notion of “coarse grained agent” and allow such entities to be candidate users. Or, maybe any coarse grained agent has to be an actual agent with a similar loss function, in which case everything works out on its own. These are nuances that probably require uncovering more of the math to understand properly.
Oh, so it seems we need a coarse grained user (a vague enough physical realization of the user) for threshold problems to arise. I understand now, thank you again!
Some quick responses:
About the “lock-in” problem: I don’t think lock-in is a meaningful concern. I’m an agent with specific preferences, and I’m making decision based on these preferences. The decision which AI to run is just another decision. Hence, there’s no philosophical reason I shouldn’t make it based on my current preferences. The confusion leading to viewing this as a problem comes from conflating long-term terminal preferences with short-term instrumental preferences based on some possibly erroneous beliefs. Notice also that in IBP, preferences can directly depend on computations, so they can be abstract / “meta” / indirect.
About “we’re not really checking which computations run, but which computations’ outputs the universe has information about… this is a problem to fix”. I don’t think it’s a problem. The former is the same as the latter. If, like in your example, P is a program that outputs a bit and Q is a program that outputs not-P, then P is running iff Q is running (as long as we’re at an epistemic vantage point that knows that Q = not-P). This seems intuitive to me: it makes no sense to distinguish between computing the millionth digit of pi in binary, and computing the not of the millionth digit of pi in binary. It’s essentially the same computation with different representation of the result. To put it differently, if P is an uploaded human and Q is a different program which I know to be functionally equivalent to P, then Q is also considered to be an uploaded human. This is a philosophical commitment, but I consider it to be pretty reasonable.
About “the current framework only allows for our AGI to give positive value to computations”: yes, this is a major problem. There might be good answers to this, but currently all the candidate answers I know are pretty weird (i.e. require biting some philosophical bullet). I believe that we’ll understand more one way or the other as we progress in our mathematical inquiry.
About “a model of human cognitive science” and “pruning mechanisms”: I no longer believe these are necessary. I now think we don’t need to explicitly filter acausal attackers. Instead, in IBP every would-be mesa-optimizer is toothless because it automatically has to contain a simulation of the user and therefore it is (i) a valid hypothesis from the user’s POV and (ii) induces the correct user preferences.
Thank you for taking the time to read and answer!
I understand you only care about maximizing your current preferences (which might include long-term flourishing of humanity), and not some vague “longtermist potential” independent of your preferences. I agree, but it would seem like most EAs would disagree (or maybe this point just hasn’t been driven home for them yet).
That’s interesting, thank you! I’ll give some thought to whether, even if this development holds, the massive search might have sneaked in some other avenue. Even without a coarse-grained simulated user, I don’t immediately see why some simulation hypotheses (maybe specifically tailored to the way in which the AI encodes its physical hypotheses) would not be able to alter underlying physics in such a way as to provide a tighter causal loop between AI and simulator, so that User Detection yields a simulator. More concretely: a simulator might introduce microscopic variations in the simulation (affecting the AI’s perceptions) depending on its moment to moment behavior, and also perceive the AI’s outputs “even faster” than the simulated human user does (on the simulator’s world, maybe just by slowing down the simulation?).
Say P searches for a model of a theory T. Say Q simulates a room with a human, and a computer which distributes an electric shock to the human iff it finds a contradiction derived from T, and Q outputs whether the human screamed in pain (and suppose the human screams in pain iff they are shocked). Both reject at time t if they haven’t accepted yet, but suppose we know one of the two searches will finish before t.
I guess you will tell me “even if P = not-Q, the programs are not functionally equivalent, because the first carries more information (for instance, from it way more information can be computed, if we rearrange what it chooses as output, or similarly peek into its computations)”. But where is the boundary drawn between “rearranging what the program outputs or peeking into it to extract more information which was already there” and “rearranging the program in a way that outputs information that wasn’t already there, or peeking and processing what we see to learn information that wasn’t already there”?
Yes, I think most EAs are confused about ethics (see e.g. 1 2 3), which is why I’m not sure I count as EA or merely as “EA-adjacent”[1].
We design user detection so that anything below a threshold is a “user” (rather than only the extreme agent being the user), and if there are multiple (or no) “users” we discard the hypothesis. So, yes, there is still some filtering going on, just not as complex as before.
But maybe this is not your main concern. You said “our best plan at avoiding those is applying various approximate pruning mechanisms (as already happens in many other Alignment proposals)”. This is not how I would put it. My goal is to have an algorithm for which we know its theoretical guarantees (e.g. having such-and-such regret bound w.r.t. such-and-such prior). I believe that deep learning has theoretical guarantees, we just don’t know what they are. We will need to either (i) understand what guarantees DL has and how to modify it in order to impose the guarantees we actually want or (ii) come up with a completely different algorithm for satisfying the new guarantees. Either will be challenging, but there are reasons to be cautiously optimistic (mainly the progress that’s already happening, and the fact that looking for algorithms is easier when you know exactly which mathematical property you need).
The difference is, if there’s actually a room with a human (or the simulation of a room with a human), then there are other computations that are running (all the outputs of the human throughout the process), not just the one bit about whether the human screamed in pain. That’s how we know that in this situation a human exists, whereas if we only have a computer running P then no human exists. We can’t just “rearrange the program in a way that outputs information that wasn’t already there”, because if it isn’t already there, the bridge transform will not assert this rearranged program is running.
Due to the unfortunate timing of this discussion, I feel the need to clarify: this has absolutely nothing to do with FTX. I would have said exactly the same before those recent events.
Thank you again for answering!
Yes, but simulators might not just “alter reality so that they are slightly more causally tight than the user”, they might even “alter reality so that they are inside the threshold and the user no longer is”, right? I guess that’s why you mention some filtering is still needed.
I understand now. I guess my point would then be restated as: given the amount of room that simulators (intuitively seem to) have to trick the AGI (even with all the above developments), it would seem like no training procedure implementing PreDCA can be modified/devised so as to achieve the guarantee of (almost surely) avoiding acausal attacks. Not because of formal guarantees being impossible to prove about that training procedure (e.g. DL), but because pruning attacks from the space of hypotheses is too complicated of a search for any human-made algorithm/procedure to carry out (because of the variety of attacks and the vastness of the space of hypotheses).
Of course! I understand now, thank you.
No. The simulation needs to imitate the null hypothesis (what we understand as reality), otherwise it’s falsified. Therefore, it has to be computing every part of the null universe visible to the AI. In particular, it has to compute the AI responding to the user responding to the AI. So, it’s not possible for the attacker to make the user-AI loop less tight.
The variety of attacks doesn’t imply the impossibility of defending from them. In cryptography, we have protocols immune from all attacks[1] despite a vast space of possible attacks. Similarly, here I’m hoping to gradually transform the informal arguments above into a rigorous theorem (or well-supported conjecture) that the system is immune.
As long as the assumptions of the model hold, ofc. And, assuming some (highly likely) complexity-theoretic conjectures.
Yes, I had understood that, but this is only the case in the limit when the AI is completely certain about every minute detail about its immediate physical reality, right? Otherwise, as in my above example, the simulator could introduce microscopic variations (wherever the AI isn’t yet completely certain about reality, for instance in some parts of the user’s brain) which subtly alter reality in such a way that the information between AI and user from counterfactual actions takes longer to arrive. Or am I missing something?
You’re right, thank you!
If the information takes a little longer to arrive, then the user will still be inside the threshold.
A more concerning problem is, what if the simulation only contains a coarse grained simulation of the user s.t. it doesn’t register as an agent. To account for this, we might need to define a notion of “coarse grained agent” and allow such entities to be candidate users. Or, maybe any coarse grained agent has to be an actual agent with a similar loss function, in which case everything works out on its own. These are nuances that probably require uncovering more of the math to understand properly.
Oh, so it seems we need a coarse grained user (a vague enough physical realization of the user) for threshold problems to arise. I understand now, thank you again!