It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.
Given two agents G and H, which can ask which points on G‘s timeline are in the causal past of which points of H‘s timeline. To answer this, consider the counterfactual in which G takes a random action (or sequence of actions) at some point (or interval) on G‘s timeline, and measure the mutual information between this action(s) and H‘s observations at some interval on H’s timeline.
Using this, we can effectively construct a future “causal cone” emanating from the AI’s origin, and also a past causal cone emanating from some time t on the AI’s timeline. Then, “nearby” agents will meet the intersection of these cones for low values of t whereas “faraway” agents will only meet it for high values of t or not at all. To first approximation, the user would be the “nearest” precursor[1] agent i.e. the one meeting the intersection for the minimal t.
More precisely, we expect the user’s observations to have nearly maximal mutual information with the AI’s actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI’s sensors measure every nerve signal emanating from the user’s brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.
This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.
More on Counterfactuals
In the parent post I suggested “instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe”. A possible elegant way to implement this:
Consider the entire portion ¯Θ of our (simplicity) prior which consists of coarsenings of Θ.
There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis Θ∈□(Γ×Φ), for Γ=ΣR, we consider its bridge transform B∈□(Γ×2Γ×Φ). Given some subset of programs Q⊆R we can define Δ:=ΣQ then project B to BΔ∈□(Γ×2Δ)[1]. We can then take bridge transform again to get some C∈□(Γ×2Γ×2Δ). The 2Γ factor now tells us which programs causally affect the manifestation of programs in Q. Notice that by Proposition 2.8 in the IBP article, when Q=R we just get all programs that are running, which makes sense.
Agreement Rules Out Mesa-Optimization
The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann’s) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don’t lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis.
If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user’s own posterior (i.e. the differences can be attributed the AIs superior reasoning). Therefore, the AI will act on the user’s behalf, leaving no room for mesa-optimizers. Any would-be mesa-optimizer has to take the shape of a hypothesis that the user should also believe, within which the pointer-to-values still points to the right place.
Two nuances:
Maybe in practice there’s still room for simulation hypotheses of the AI which contain coarse-grained simulations of the user. In this case, the user detection algorithm might need to allow for coarsely simulated agents.
If the agreement theorem needs condition v, we get a self-referential loop: if the AI and the user converge to the same utility function, the theorem guarantees them to converge to the same utility function, but otherwise it doesn’t. This might make the entire thing a useless tautology, or there might be a way to favorably resolve the self-reference, vaguely analogously to how Loeb’s theorem allows resolving the self-reference in prisoner dilemma games between FairBots.
There are actually two ways to do this, corresponding to the two natural mappings Γ×2Γ→Γ×2Δ. The first is just projecting the subset of Γ to a subset of Δ, the second is analogous to what’s used in Proposition 2.16 of the IBP article. I’m not entirely sure what’s correct here.
Hi Vanessa! Thanks again for your previous answers. I’ve got one further concern.
Are all mesa-optimizers really only acausal attackers?
I think mesa-optimizers don’t need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it).
Of course, since the only way to change the AGI’s actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn’t need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away).
That is, if we don’t think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it’s better understood as an alignment failure.
The way I see PreDCA (and this might be where I’m wrong) is as an “outer top-level protocol” which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best action), and given it does that correctly, since the outer objective we’ve provided is clearly aligned, we’re safe. That is, PreDCA is an outer objective that solves outer alignment. But we still need to ensure the hypotheses update is carried out correctly (and that’s everything our AGI is really doing).
I don’t think this realization rules out your Agreement solution, since if truly no hypothesis can steer the resulting actions in undesirable ways (maybe because every hypothesis with a user has the human as the user), then obviously not even optimizers in hypothesis update can find malign hypotheses (although they can still causally attack hacking the computer they’re running on etc.). But I think your Agreement solution doesn’t completely rule out any undesirable hypothesis, but only makes it harder for an acausal attacker to have the user not be the human. And in this situation, an optimizer in hypothesis update could still select for malign hypotheses in which the human is subtly incorrectly modelled in such a precise way that has relevant consequences for the actions chosen. This can again be seen as a capabilities failure (not modelling the human well enough), but it will always be present to some degree, and it could be exploited by mesa-optimizers.
First, no, the AGI is not going to “employ complex heuristics to ever-better approximate optimal hypotheses update”. The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).
Second, there’s the issue of non-cartesian attacks (“hacking the computer”). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won’t hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.
Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.
The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability.
I understand now, that was the main misunderstanding motivating my worries. This and your other two points have driven home for me the role mathematical guarantees play in the protocol, which I wasn’t contemplating. Thanks again for your kind answers!
Two more remarks.
User Detection
It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.
Given two agents G and H, which can ask which points on G‘s timeline are in the causal past of which points of H‘s timeline. To answer this, consider the counterfactual in which G takes a random action (or sequence of actions) at some point (or interval) on G‘s timeline, and measure the mutual information between this action(s) and H‘s observations at some interval on H’s timeline.
Using this, we can effectively construct a future “causal cone” emanating from the AI’s origin, and also a past causal cone emanating from some time t on the AI’s timeline. Then, “nearby” agents will meet the intersection of these cones for low values of t whereas “faraway” agents will only meet it for high values of t or not at all. To first approximation, the user would be the “nearest” precursor[1] agent i.e. the one meeting the intersection for the minimal t.
More precisely, we expect the user’s observations to have nearly maximal mutual information with the AI’s actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI’s sensors measure every nerve signal emanating from the user’s brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.
This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.
More on Counterfactuals
In the parent post I suggested “instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe”. A possible elegant way to implement this:
Consider the entire portion ¯Θ of our (simplicity) prior which consists of coarsenings of Θ.
Apply the counterfactual to ¯Θ.
Renormalize the result from HUC to HUD.
We still need precursor detection, otherwise the AI can create some new agent and make it the nominal “user”.
Causality in IBP
There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis Θ∈□(Γ×Φ), for Γ=ΣR, we consider its bridge transform B∈□(Γ×2Γ×Φ). Given some subset of programs Q⊆R we can define Δ:=ΣQ then project B to BΔ∈□(Γ×2Δ)[1]. We can then take bridge transform again to get some C∈□(Γ×2Γ×2Δ). The 2Γ factor now tells us which programs causally affect the manifestation of programs in Q. Notice that by Proposition 2.8 in the IBP article, when Q=R we just get all programs that are running, which makes sense.
Agreement Rules Out Mesa-Optimization
The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann’s) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don’t lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis.
If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user’s own posterior (i.e. the differences can be attributed the AIs superior reasoning). Therefore, the AI will act on the user’s behalf, leaving no room for mesa-optimizers. Any would-be mesa-optimizer has to take the shape of a hypothesis that the user should also believe, within which the pointer-to-values still points to the right place.
Two nuances:
Maybe in practice there’s still room for simulation hypotheses of the AI which contain coarse-grained simulations of the user. In this case, the user detection algorithm might need to allow for coarsely simulated agents.
If the agreement theorem needs condition v, we get a self-referential loop: if the AI and the user converge to the same utility function, the theorem guarantees them to converge to the same utility function, but otherwise it doesn’t. This might make the entire thing a useless tautology, or there might be a way to favorably resolve the self-reference, vaguely analogously to how Loeb’s theorem allows resolving the self-reference in prisoner dilemma games between FairBots.
There are actually two ways to do this, corresponding to the two natural mappings Γ×2Γ→Γ×2Δ. The first is just projecting the subset of Γ to a subset of Δ, the second is analogous to what’s used in Proposition 2.16 of the IBP article. I’m not entirely sure what’s correct here.
Hi Vanessa! Thanks again for your previous answers. I’ve got one further concern.
Are all mesa-optimizers really only acausal attackers?
I think mesa-optimizers don’t need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it).
Of course, since the only way to change the AGI’s actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn’t need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away).
That is, if we don’t think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it’s better understood as an alignment failure.
The way I see PreDCA (and this might be where I’m wrong) is as an “outer top-level protocol” which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best action), and given it does that correctly, since the outer objective we’ve provided is clearly aligned, we’re safe. That is, PreDCA is an outer objective that solves outer alignment. But we still need to ensure the hypotheses update is carried out correctly (and that’s everything our AGI is really doing).
I don’t think this realization rules out your Agreement solution, since if truly no hypothesis can steer the resulting actions in undesirable ways (maybe because every hypothesis with a user has the human as the user), then obviously not even optimizers in hypothesis update can find malign hypotheses (although they can still causally attack hacking the computer they’re running on etc.). But I think your Agreement solution doesn’t completely rule out any undesirable hypothesis, but only makes it harder for an acausal attacker to have the user not be the human. And in this situation, an optimizer in hypothesis update could still select for malign hypotheses in which the human is subtly incorrectly modelled in such a precise way that has relevant consequences for the actions chosen. This can again be seen as a capabilities failure (not modelling the human well enough), but it will always be present to some degree, and it could be exploited by mesa-optimizers.
First, no, the AGI is not going to “employ complex heuristics to ever-better approximate optimal hypotheses update”. The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).
Second, there’s the issue of non-cartesian attacks (“hacking the computer”). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won’t hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.
Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.
I understand now, that was the main misunderstanding motivating my worries. This and your other two points have driven home for me the role mathematical guarantees play in the protocol, which I wasn’t contemplating. Thanks again for your kind answers!