Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)
Given a program, we can tell whether it is an agent, and if so, which utility function it has[1] (the “evaluating agent” section of the article).
I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:
For each hypothesis in the prior, check which agents are precursors of our agent according to this hypothesis.
Among the precursors, check whether some are definitely neither humans nor animals nor previously created AIs.
If there are precursors like that, discard the hypothesis (it is probably a malign simulation hypothesis).
If there are no precursors like that, decide which of them are humans.
Follow an aggregate of the utility functions of the human precursors (conditional on the given hypothesis).
Detection
How to identify agents which are our agent’s precursors? Let our agent be G and let H be another agents which exists in the universe according to hypothesis Θ[2]. Then, H is considered to be a precursor of G in universe Θ when there is some H-policy σ s.t. applying the counterfactual ”H follows σ” to Θ (in the usual infra-Bayesian sense) causes G not to exist (i.e. its source code doesn’t run).
A possible complication is, what if Θ implies that H creates G / doesn’t interfere with the creation of G? In this case H might conceptually be a precursor, but the definition would not detect it. It is possible that any such Θ would have a sufficiently large description complexity penalty that it doesn’t matter. On the second hand, if Θ is unconditionally Knightian uncertain about H creating G then the utility will be upper bounded by the scenario in which G doesn’t exist, which is liable to make Θ an effectively falsified hypothesis. On the third hand, it seems plausible that the creation of G by H would be contingent on G’s behavior (Newcomb-style, which we know how it works in infra-Bayesianism), in which case Θ is not falsified and the detection works. In any case, there is a possible variant of the definition to avoid the problem: instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe (in the hope that some such coarsening would leave the creation of G uncertain).
Notice that any agent whose existence is contingent on G’s policy cannot be detected as a precursor: the corresponding program doesn’t even “run”, because we don’t apply a G-policy-counterfactual to the bridge transform.
Classification
How to decide which precursors are which? One tool we have is the g parameter and the computational resource parameters in the definition of intelligence. In addition we might be able to create a very rough neuroscience-based model of humans. Also, we will hopefully have a lot of information about other AIs that can be relevant. Using these, it might be possible to create a rough benign/malign/irrelevant classifier, s.t.
Humans are classified as “benign”.
Most (by probability mass) malign simulation hypotheses contain at least one precursor classified as “malign”.
Non-human agents that exist in the causal past of our AI in the null (non-simulation) hypothesis are classified as “irrelevant”.
Assistance
Once we detected and classified precursors in each hypothesis, we discard all hypotheses that contain malign precursors. In the remaining hypotheses, we perform some kind of aggregation on the utility functions of the benign precursors (for example, this). The utility functions from different hypotheses are somehow normalized to form the overall utility function. Alternatively, we do a maximal lottery vote for the policy, where each hypothesis is a voter with weight proportional to its prior probability mass.
Inner Alignment
Why can this solve inner alignment? In any model-based approach, the AI doesn’t train the policy directly. Instead, it trains models and uses them to compute the policy. I suspect that the second step cannot create mesa-optimizers, since it only involves control and not learning[3]. Hence, any mesa-optimizer has to originate from the first step, i.e. from the model/hypothesis. And, any plausible physicalist hypothesis which contains a mesa-optimizer has to look like a malign simulation hypothesis.
Outer Alignment
Why can this solve outer alignment? Presumably, we are aggregating human utility functions. This doesn’t assume humans are perfect agents: g can be less than infinity. I suspect that when g<∞ the utility function becomes somewhat ambiguous, but the ambiguity can probably be resolved arbitrarily or maybe via a risk-averse method. What if the AI modifies the humans? Then only pre-modification humans are detected as precursors, and there’s no problem.
Moreover, the entire method can be combined with the Hippocratic principle to avoid catastrophic mistakes out of ignorance (i.e. to go from intent alignment to impact alignment).
We do need a lot more research to fully specify this “utility reconstruction” and check that it satisfies reasonable desiderata. But, the existence of a natural utility-function-dependent measure of intelligence suggests it is possible.
In modern deep RL systems, there might not be a clear line between learning and control. For example, if we use model-free RL to produce the policy for a given hypothesis, then there is learning happening there as well. In such an architecture, the value function or Q-function should be regarded as part of the hypothesis for our purpose.
Then, H is considered to be a precursor of G in universe Θ when there is some H-policy σ s.t. applying the counterfactual ”H follows σ” to Θ (in the usual infra-Bayesian sense) causes G not to exist (i.e. its source code doesn’t run).
A possible complication is, what if Θ implies that H creates G / doesn’t interfere with the creation of G? In this case H might conceptually be a precursor, but the definition would not detect it.
Can you please explain how does this not match the definition? I don’t yet understand all the math, but intuitively, if H creates G / doesn’t interfere with the creation of G, then if H instead followed policy “do not create G/ do interfere with the creation of G”, then G’s code wouldn’t run?
Can you please give an example of a precursor that does match the definition?
The problem is that if Θ implies that H creates G but you consider a counterfactual in which H doesn’t create G then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it “hard counterfactuals”) only makes sense when the condition you’re counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post I suggested solving this by defining “soft counterfactuals” where you consider coarsenings of Θ in addition to Θ itself.
These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.
Weaknesses
My main concerns with this approach are:
The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent interactions in particular is required to gain sufficient confidence.
The feasibility of a good enough classifier. At present, I don’t have a concrete plan for attacking this, as it requires inputs from outside of computer science.
Inherent “incorrigibility”: once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won’t defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified so much that I’m not sure it is solved (rather than dissolved) even in the Book. Moreover, the concern can be ameliorated by sufficiently powerful interpretability tools. It is therefore desirable to think more of how to achieve interpretability in this context.
A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on how the system produces its behavior? The cartesian measure of agency I proposed seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent “where” it is in the universe. However, on closer examination, the physicalist g is far from purely behaviorist, and this is true even for cartesian Turing RL. Indeed, the policy describes not only the agent’s interaction with the actual environment but also its interaction with the “envelope” computer. In a sense, the policy can be said to reflects the agent’s “conscious thoughts”.
This means that specifying an agent requires not only specifying its source code but also the “envelope semantics” C (possibly we also need to penalize for the complexity of C in the definition of g). Identifying that an agent exists requires not only that its source code is running, but also, at least that its history h is C-consistent with the α∈2Γ variable of the bridge transform. That is, for any y∈α we must have dCy for some destiny d⊐h. In other words, we want any computation the agents ostensibly runs on the envelope to be one that is physically manifest (it might be this condition isn’t sufficiently strong, since it doesn’t seem to establish a causal relation between the manifesting and the agent’s observations, but it’s at least necessary).
Notice also that the computational power of the envelope implied by C becomes another characteristic of the agent’s intelligence, together with g as a function of the cost of computational resources. It might be useful to come up with natural ways to quantify this power.
It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.
Given two agents G and H, which can ask which points on G‘s timeline are in the causal past of which points of H‘s timeline. To answer this, consider the counterfactual in which G takes a random action (or sequence of actions) at some point (or interval) on G‘s timeline, and measure the mutual information between this action(s) and H‘s observations at some interval on H’s timeline.
Using this, we can effectively construct a future “causal cone” emanating from the AI’s origin, and also a past causal cone emanating from some time t on the AI’s timeline. Then, “nearby” agents will meet the intersection of these cones for low values of t whereas “faraway” agents will only meet it for high values of t or not at all. To first approximation, the user would be the “nearest” precursor[1] agent i.e. the one meeting the intersection for the minimal t.
More precisely, we expect the user’s observations to have nearly maximal mutual information with the AI’s actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI’s sensors measure every nerve signal emanating from the user’s brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.
This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.
More on Counterfactuals
In the parent post I suggested “instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe”. A possible elegant way to implement this:
Consider the entire portion ¯Θ of our (simplicity) prior which consists of coarsenings of Θ.
There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis Θ∈□(Γ×Φ), for Γ=ΣR, we consider its bridge transform B∈□(Γ×2Γ×Φ). Given some subset of programs Q⊆R we can define Δ:=ΣQ then project B to BΔ∈□(Γ×2Δ)[1]. We can then take bridge transform again to get some C∈□(Γ×2Γ×2Δ). The 2Γ factor now tells us which programs causally affect the manifestation of programs in Q. Notice that by Proposition 2.8 in the IBP article, when Q=R we just get all programs that are running, which makes sense.
Agreement Rules Out Mesa-Optimization
The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann’s) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don’t lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis.
If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user’s own posterior (i.e. the differences can be attributed the AIs superior reasoning). Therefore, the AI will act on the user’s behalf, leaving no room for mesa-optimizers. Any would-be mesa-optimizer has to take the shape of a hypothesis that the user should also believe, within which the pointer-to-values still points to the right place.
Two nuances:
Maybe in practice there’s still room for simulation hypotheses of the AI which contain coarse-grained simulations of the user. In this case, the user detection algorithm might need to allow for coarsely simulated agents.
If the agreement theorem needs condition v, we get a self-referential loop: if the AI and the user converge to the same utility function, the theorem guarantees them to converge to the same utility function, but otherwise it doesn’t. This might make the entire thing a useless tautology, or there might be a way to favorably resolve the self-reference, vaguely analogously to how Loeb’s theorem allows resolving the self-reference in prisoner dilemma games between FairBots.
There are actually two ways to do this, corresponding to the two natural mappings Γ×2Γ→Γ×2Δ. The first is just projecting the subset of Γ to a subset of Δ, the second is analogous to what’s used in Proposition 2.16 of the IBP article. I’m not entirely sure what’s correct here.
Hi Vanessa! Thanks again for your previous answers. I’ve got one further concern.
Are all mesa-optimizers really only acausal attackers?
I think mesa-optimizers don’t need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it).
Of course, since the only way to change the AGI’s actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn’t need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away).
That is, if we don’t think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it’s better understood as an alignment failure.
The way I see PreDCA (and this might be where I’m wrong) is as an “outer top-level protocol” which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best action), and given it does that correctly, since the outer objective we’ve provided is clearly aligned, we’re safe. That is, PreDCA is an outer objective that solves outer alignment. But we still need to ensure the hypotheses update is carried out correctly (and that’s everything our AGI is really doing).
I don’t think this realization rules out your Agreement solution, since if truly no hypothesis can steer the resulting actions in undesirable ways (maybe because every hypothesis with a user has the human as the user), then obviously not even optimizers in hypothesis update can find malign hypotheses (although they can still causally attack hacking the computer they’re running on etc.). But I think your Agreement solution doesn’t completely rule out any undesirable hypothesis, but only makes it harder for an acausal attacker to have the user not be the human. And in this situation, an optimizer in hypothesis update could still select for malign hypotheses in which the human is subtly incorrectly modelled in such a precise way that has relevant consequences for the actions chosen. This can again be seen as a capabilities failure (not modelling the human well enough), but it will always be present to some degree, and it could be exploited by mesa-optimizers.
First, no, the AGI is not going to “employ complex heuristics to ever-better approximate optimal hypotheses update”. The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).
Second, there’s the issue of non-cartesian attacks (“hacking the computer”). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won’t hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.
Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.
The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability.
I understand now, that was the main misunderstanding motivating my worries. This and your other two points have driven home for me the role mathematical guarantees play in the protocol, which I wasn’t contemplating. Thanks again for your kind answers!
There’s a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD, which we call autocalibrating quantilized RL (AQRL).
First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn’t contain terms such as “oh btw don’t kill people while you’re building the nanosystem”. However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the “relative difficulty assumption” (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way.
We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot.
Notice the hardening the system (i.e. making it more secure using conventional methods) increases the likelihood of the RDA. That is, although it is infeasible to manually harden the system against superintelligent attacks, it can very well be feasible to harden it against attacks available at the minimal level of optimization power (= quantilization parameter) required to meet the performance goal, at least to such an extent that the intended solution becomes more appealing for the AI.
AQRL involves the risk that the RDA is false, even though it seems true to the designers. To mitigate this risk, we can deploy honeypots. That is, we add auxiliary goals to our reward function s.t. (i) if the goal is reached, maximal total reward is awarded (ii) the goal is harmless in itself (iii) reaching the goal is still harder than the intended solution, but by a smaller margin than performing attacks (iv) we can detect the event of reaching the goal. For example, suppose that the AI is not supposed to have Internet access, so gaining Internet access is a path to a variety of possible attacks. Then, we can add an auxiliary goal of the form “send such-and-such innocuous packet to such-and-such IP address”. If the RDA is false and the AI succeeds gaining Internet access earlier (on the quantilization parameter scale) than reaching the performance goal, once the AI sent the packet it will randomize all its remaining actions (since it is operating at the earliest quantilization parameter which allows this). The users will then be alerted that security was breached and rethink their next steps.
Now, suppose that we can only specify the reward function via concepts that have to be learned from examples. This opens the risk of generalization failure. Such risk is not necessarily a problem since we are still protected by quantilization. However, it needs to be factored into our safety bounds. An especially concerning possibility is that the AI will interpret the concepts as “would be labeled by humans as such-and-such” instead of the intended interpretation, creating incentives to deceive. Moreover, this encourages learning human models which puts more pressure on the RDA by making it easier to attack.
In order to avoid learning human models, we can use methods along these lines. Specifically, suppose that, in addition to the dataset for learning out task-relevant concepts, we supply a dataset designed specifically for learning human models. We can then design an algorithm that produces models of the task-relevant concepts s.t. the human-centric dataset cannot be compressed w.r.t. them. That is, the algorithm makes sure that access to the models it produces doesn’t help with learning the human-centric dataset, implying that the concept models don’t contain human models. Moreover, we can extend this approach to the entire process, s.t. human models are not learned from any other source of information as well.
Notice that although this method excludes human models from the actual RL process, it encourages them in intermediate computations, since we are actively supplying a human-centric dataset to the system. But, this is not dangerous if the intermediate computation is hardened against non-Cartesian daemons (a big “if” but one we need to deal with anyway).
Master post for alignment protocols.
Other relevant shortforms:
Autocalibrated quantilized debate
Hippocratic principle
IDA variants
Dialogic RL
More dialogic RL
Precursor Detection, Classification and Assistance (PreDCA)
Infra-Bayesian physicalism provides us with two key building blocks:
Given a hypothesis about the universe, we can tell which programs are running. (This is just the bridge transform.)
Given a program, we can tell whether it is an agent, and if so, which utility function it has[1] (the “evaluating agent” section of the article).
I will now outline how we can use these building blocks to solve both the inner and outer alignment problem. The rough idea is:
For each hypothesis in the prior, check which agents are precursors of our agent according to this hypothesis.
Among the precursors, check whether some are definitely neither humans nor animals nor previously created AIs.
If there are precursors like that, discard the hypothesis (it is probably a malign simulation hypothesis).
If there are no precursors like that, decide which of them are humans.
Follow an aggregate of the utility functions of the human precursors (conditional on the given hypothesis).
Detection
How to identify agents which are our agent’s precursors? Let our agent be G and let H be another agents which exists in the universe according to hypothesis Θ[2]. Then, H is considered to be a precursor of G in universe Θ when there is some H-policy σ s.t. applying the counterfactual ”H follows σ” to Θ (in the usual infra-Bayesian sense) causes G not to exist (i.e. its source code doesn’t run).
A possible complication is, what if Θ implies that H creates G / doesn’t interfere with the creation of G? In this case H might conceptually be a precursor, but the definition would not detect it. It is possible that any such Θ would have a sufficiently large description complexity penalty that it doesn’t matter. On the second hand, if Θ is unconditionally Knightian uncertain about H creating G then the utility will be upper bounded by the scenario in which G doesn’t exist, which is liable to make Θ an effectively falsified hypothesis. On the third hand, it seems plausible that the creation of G by H would be contingent on G’s behavior (Newcomb-style, which we know how it works in infra-Bayesianism), in which case Θ is not falsified and the detection works. In any case, there is a possible variant of the definition to avoid the problem: instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe (in the hope that some such coarsening would leave the creation of G uncertain).
Notice that any agent whose existence is contingent on G’s policy cannot be detected as a precursor: the corresponding program doesn’t even “run”, because we don’t apply a G-policy-counterfactual to the bridge transform.
Classification
How to decide which precursors are which? One tool we have is the g parameter and the computational resource parameters in the definition of intelligence. In addition we might be able to create a very rough neuroscience-based model of humans. Also, we will hopefully have a lot of information about other AIs that can be relevant. Using these, it might be possible to create a rough benign/malign/irrelevant classifier, s.t.
Humans are classified as “benign”.
Most (by probability mass) malign simulation hypotheses contain at least one precursor classified as “malign”.
Non-human agents that exist in the causal past of our AI in the null (non-simulation) hypothesis are classified as “irrelevant”.
Assistance
Once we detected and classified precursors in each hypothesis, we discard all hypotheses that contain malign precursors. In the remaining hypotheses, we perform some kind of aggregation on the utility functions of the benign precursors (for example, this). The utility functions from different hypotheses are somehow normalized to form the overall utility function. Alternatively, we do a maximal lottery vote for the policy, where each hypothesis is a voter with weight proportional to its prior probability mass.
Inner Alignment
Why can this solve inner alignment? In any model-based approach, the AI doesn’t train the policy directly. Instead, it trains models and uses them to compute the policy. I suspect that the second step cannot create mesa-optimizers, since it only involves control and not learning[3]. Hence, any mesa-optimizer has to originate from the first step, i.e. from the model/hypothesis. And, any plausible physicalist hypothesis which contains a mesa-optimizer has to look like a malign simulation hypothesis.
Outer Alignment
Why can this solve outer alignment? Presumably, we are aggregating human utility functions. This doesn’t assume humans are perfect agents: g can be less than infinity. I suspect that when g<∞ the utility function becomes somewhat ambiguous, but the ambiguity can probably be resolved arbitrarily or maybe via a risk-averse method. What if the AI modifies the humans? Then only pre-modification humans are detected as precursors, and there’s no problem.
Moreover, the entire method can be combined with the Hippocratic principle to avoid catastrophic mistakes out of ignorance (i.e. to go from intent alignment to impact alignment).
We do need a lot more research to fully specify this “utility reconstruction” and check that it satisfies reasonable desiderata. But, the existence of a natural utility-function-dependent measure of intelligence suggests it is possible.
I’m ignoring details like “what if H only exists with certain probability”. The more careful analysis is left for later.
In modern deep RL systems, there might not be a clear line between learning and control. For example, if we use model-free RL to produce the policy for a given hypothesis, then there is learning happening there as well. In such an architecture, the value function or Q-function should be regarded as part of the hypothesis for our purpose.
Can you please explain how does this not match the definition? I don’t yet understand all the math, but intuitively, if H creates G / doesn’t interfere with the creation of G, then if H instead followed policy “do not create G/ do interfere with the creation of G”, then G’s code wouldn’t run?
Can you please give an example of a precursor that does match the definition?
The problem is that if Θ implies that H creates G but you consider a counterfactual in which H doesn’t create G then you get an inconsistent hypothesis i.e. a HUC which contains only 0. It is not clear what to do with that. In other words, the usual way of defining counterfactuals in IB (I tentatively named it “hard counterfactuals”) only makes sense when the condition you’re counterfactualizing on is something you have Knightian uncertainty about (which seems safe to assume if this condition is about your own future action but not safe to assume in general). In a child post I suggested solving this by defining “soft counterfactuals” where you consider coarsenings of Θ in addition to Θ itself.
Thank you.
Some additional thoughts.
Non-Cartesian Daemons
These are notoriously difficult to deal with. The only methods I know are that applicable to other protocols are homomorphic cryptography and quantilization of envelope (external computer) actions. But, in this protocol, they are dealt with the same as Cartesian daemons! At least if we assume a non-Cartesian attack requires an envelope action, the malign hypotheses which are would-be sources of such actions are discarded without giving an opportunity for attack.
Weaknesses
My main concerns with this approach are:
The possibility of major conceptual holes in the definition of precursors. More informal analysis can help, but ultimately mathematical research in infra-Bayesian physicalism in general and infra-Bayesian cartesian/physicalist multi-agent interactions in particular is required to gain sufficient confidence.
The feasibility of a good enough classifier. At present, I don’t have a concrete plan for attacking this, as it requires inputs from outside of computer science.
Inherent “incorrigibility”: once the AI becomes sufficiently confident that it correctly detected and classified its precursors, its plans won’t defer to the users any more than the resulting utility function demands. On the second hand, I think the concept of corrigibility is underspecified so much that I’m not sure it is solved (rather than dissolved) even in the Book. Moreover, the concern can be ameliorated by sufficiently powerful interpretability tools. It is therefore desirable to think more of how to achieve interpretability in this context.
A question that often comes up in discussion of IRL: are agency and values purely behavioral concepts, or do they depend on how the system produces its behavior? The cartesian measure of agency I proposed seems purely behavioral, since it only depends on the policy. The physicalist version seems less so since it depends on the source code, but this difference might be minor: this role of the source is merely telling the agent “where” it is in the universe. However, on closer examination, the physicalist g is far from purely behaviorist, and this is true even for cartesian Turing RL. Indeed, the policy describes not only the agent’s interaction with the actual environment but also its interaction with the “envelope” computer. In a sense, the policy can be said to reflects the agent’s “conscious thoughts”.
This means that specifying an agent requires not only specifying its source code but also the “envelope semantics” C (possibly we also need to penalize for the complexity of C in the definition of g). Identifying that an agent exists requires not only that its source code is running, but also, at least that its history h is C-consistent with the α∈2Γ variable of the bridge transform. That is, for any y∈α we must have dCy for some destiny d⊐h. In other words, we want any computation the agents ostensibly runs on the envelope to be one that is physically manifest (it might be this condition isn’t sufficiently strong, since it doesn’t seem to establish a causal relation between the manifesting and the agent’s observations, but it’s at least necessary).
Notice also that the computational power of the envelope implied by C becomes another characteristic of the agent’s intelligence, together with g as a function of the cost of computational resources. It might be useful to come up with natural ways to quantify this power.
Here’s a video of a talk I gave about PreDCA.
Two more remarks.
User Detection
It can be useful to identify and assist specifically the user rather than e.g. any human that ever lived (and maybe some hominids). For this purpose I propose the following method. It also strengthens the protocol by relieving some pressure from other classification criteria.
Given two agents G and H, which can ask which points on G‘s timeline are in the causal past of which points of H‘s timeline. To answer this, consider the counterfactual in which G takes a random action (or sequence of actions) at some point (or interval) on G‘s timeline, and measure the mutual information between this action(s) and H‘s observations at some interval on H’s timeline.
Using this, we can effectively construct a future “causal cone” emanating from the AI’s origin, and also a past causal cone emanating from some time t on the AI’s timeline. Then, “nearby” agents will meet the intersection of these cones for low values of t whereas “faraway” agents will only meet it for high values of t or not at all. To first approximation, the user would be the “nearest” precursor[1] agent i.e. the one meeting the intersection for the minimal t.
More precisely, we expect the user’s observations to have nearly maximal mutual information with the AI’s actions: the user can e.g. see every symbol the AI outputs to the display. However, the other direction is less clear: can the AI’s sensors measure every nerve signal emanating from the user’s brain? To address this, we can fix t to a value s.t. we expect only the user the meet the intersection of cones, and have the AI select the agent which meets this intersection for the highest mutual information threshold.
This probably does not make the detection of malign agents redundant, since AFAICT a malign simulation hypothesis might be somehow cleverly arranged to make a malign agent the user.
More on Counterfactuals
In the parent post I suggested “instead of examining only Θ we also examine coarsenings of Θ which are not much more complex to describe”. A possible elegant way to implement this:
Consider the entire portion ¯Θ of our (simplicity) prior which consists of coarsenings of Θ.
Apply the counterfactual to ¯Θ.
Renormalize the result from HUC to HUD.
We still need precursor detection, otherwise the AI can create some new agent and make it the nominal “user”.
Causality in IBP
There seems to be an even more elegant way to define causal relationships between agents, or more generally between programs. Starting from a hypothesis Θ∈□(Γ×Φ), for Γ=ΣR, we consider its bridge transform B∈□(Γ×2Γ×Φ). Given some subset of programs Q⊆R we can define Δ:=ΣQ then project B to BΔ∈□(Γ×2Δ)[1]. We can then take bridge transform again to get some C∈□(Γ×2Γ×2Δ). The 2Γ factor now tells us which programs causally affect the manifestation of programs in Q. Notice that by Proposition 2.8 in the IBP article, when Q=R we just get all programs that are running, which makes sense.
Agreement Rules Out Mesa-Optimization
The version of PreDCA without any explicit malign hypothesis filtering might be immune to malign hypotheses, and here is why. It seems plausible that IBP admits an agreement theorem (analogous to Aumann’s) which informally amounts to the following: Given two agents Alice and Bobcat that (i) share the same physical universe, (ii) have a sufficiently tight causal relationship (each can see what the other sees), (iii) have unprivileged locations inside the physical universe, (iv) start from similar/compatible priors and (v) [maybe needed?] similar utility functions, they converge to similar/compatible beliefs, regardless of the complexity of translation between their subjective viewpoints. This is plausible because (i) as opposed to the cartesian framework, different bridge rules don’t lead to different probabilities and (ii) if Bobcat considers a simulation hypothesis plausible, and the simulation is sufficiently detailed to fool it indefinitely, then the simulation contains a detailed simulation of Alice and hence Alice must also consider this to be plausible hypothesis.
If the agreement conjecture is true, then the AI will converge to hypotheses that all contain the user, in a causal relationship with the AI that affirms them as the user. Moreover, those hypotheses will be compatible with the user’s own posterior (i.e. the differences can be attributed the AIs superior reasoning). Therefore, the AI will act on the user’s behalf, leaving no room for mesa-optimizers. Any would-be mesa-optimizer has to take the shape of a hypothesis that the user should also believe, within which the pointer-to-values still points to the right place.
Two nuances:
Maybe in practice there’s still room for simulation hypotheses of the AI which contain coarse-grained simulations of the user. In this case, the user detection algorithm might need to allow for coarsely simulated agents.
If the agreement theorem needs condition v, we get a self-referential loop: if the AI and the user converge to the same utility function, the theorem guarantees them to converge to the same utility function, but otherwise it doesn’t. This might make the entire thing a useless tautology, or there might be a way to favorably resolve the self-reference, vaguely analogously to how Loeb’s theorem allows resolving the self-reference in prisoner dilemma games between FairBots.
There are actually two ways to do this, corresponding to the two natural mappings Γ×2Γ→Γ×2Δ. The first is just projecting the subset of Γ to a subset of Δ, the second is analogous to what’s used in Proposition 2.16 of the IBP article. I’m not entirely sure what’s correct here.
Hi Vanessa! Thanks again for your previous answers. I’ve got one further concern.
Are all mesa-optimizers really only acausal attackers?
I think mesa-optimizers don’t need to be purely contained in a hypothesis (rendering them acausal attackers), but can be made up of a part of the hypotheses-updating procedures (maybe this is obvious and you already considered it).
Of course, since the only way to change the AGI’s actions is by changing its hypotheses, even these mesa-optimizers will have to alter hypothesis selection. But their whole running program doesn’t need to be captured inside any hypothesis (which would be easier for classifying acausal attackers away).
That is, if we don’t think about how the AGI updates its hypotheses, and just consider them magically updating (without any intermediate computations), then of course, the only mesa-optimizers will be inside hypotheses. If we actually think about these computations and consider a brute-force search over all hypotheses, then again they will only be found inside hypotheses, since the search algorithm itself is too simple and provides no further room for storing a subagent (even if the mesa-optimizer somehow takes advantage of the details of the search). But if more realistically our AGI employs more complex heuristics to ever-better approximate optimal hypotheses update, mesa-optimizers can be partially or completely encoded in those (put another way, those non-optimal methods can fail / be exploited). This failure could be seen as a capabilities failure (in the trivial sense that it failed to correctly approximate perfect search), but I think it’s better understood as an alignment failure.
The way I see PreDCA (and this might be where I’m wrong) is as an “outer top-level protocol” which we can fit around any superintelligence of arbitrary architecture. That is, the superintelligence will only have to carry out the hypotheses update (plus some trivial calculations over hypotheses to find the best action), and given it does that correctly, since the outer objective we’ve provided is clearly aligned, we’re safe. That is, PreDCA is an outer objective that solves outer alignment. But we still need to ensure the hypotheses update is carried out correctly (and that’s everything our AGI is really doing).
I don’t think this realization rules out your Agreement solution, since if truly no hypothesis can steer the resulting actions in undesirable ways (maybe because every hypothesis with a user has the human as the user), then obviously not even optimizers in hypothesis update can find malign hypotheses (although they can still causally attack hacking the computer they’re running on etc.). But I think your Agreement solution doesn’t completely rule out any undesirable hypothesis, but only makes it harder for an acausal attacker to have the user not be the human. And in this situation, an optimizer in hypothesis update could still select for malign hypotheses in which the human is subtly incorrectly modelled in such a precise way that has relevant consequences for the actions chosen. This can again be seen as a capabilities failure (not modelling the human well enough), but it will always be present to some degree, and it could be exploited by mesa-optimizers.
First, no, the AGI is not going to “employ complex heuristics to ever-better approximate optimal hypotheses update”. The AGI is going to be based on an algorithm which, as a mathematical fact (if not proved then at least conjectured), converges to the correct hypothesis with high probability. Just like we can prove that e.g. SVMs converge to the optimal hypothesis in the respective class, or that particular RL algorithms for small MDPs converge to the correct hypothesis (assuming realizability).
Second, there’s the issue of non-cartesian attacks (“hacking the computer”). Assuming that the core computing unit is not powerful enough to mount a non-cartesian attack on its own, such attacks can arguably be regarded as detrimental side-effects of running computations on the envelope. My hope is that we can shape the prior about such side-effects in some informed way (e.g. the vast majority of programs won’t hack the computer) s.t. we still have approximate learnability (i.e. the system is not too afraid to run computations) without misspecification (i.e. the system is not overconfident about the safety of running computations). The more effort we put into hardening the system, the easier it should be to find such a sweet spot.
Third, I hope that the agreement solution will completely rule out any undesirable hypothesis, because we will have an actual theorem that guarantees it. What are the exact assumptions going to be and what needs to be done to make sure these assumptions hold is work for the future, ofc.
I understand now, that was the main misunderstanding motivating my worries. This and your other two points have driven home for me the role mathematical guarantees play in the protocol, which I wasn’t contemplating. Thanks again for your kind answers!
There’s a class of AI risk mitigation strategies which relies on the users to perform the pivotal act using tools created by AI (e.g. nanosystems). These strategies are especially appealing if we want to avoid human models. Here is a concrete alignment protocol for these strategies, closely related to AQD, which we call autocalibrating quantilized RL (AQRL).
First, suppose that we are able formulate the task as episodic RL with a formally specified reward function. The reward function is necessarily only a proxy for our true goal, since it doesn’t contain terms such as “oh btw don’t kill people while you’re building the nanosystem”. However, suppose the task is s.t. accomplishing it in the intended way (without Goodharting or causing catastrophic side effects) is easier than performing any attack. We will call this the “relative difficulty assumption” (RDA). Then, there exists a value for the quantilization parameter s.t. quantilized RL performs the task in the intended way.
We might not know how to set the quantilization parameter on our own, but we can define a performance goal for the task (in terms of expected total reward) s.t. the RDA holds. This leads to algorithms which gradually tune the quantilization parameter until the performance goal is met, while maintaining a proper balance between safety and sample complexity. Here it is important to keep track of epistemic vs. aleatoric uncertainty: the performance goal is the expectation of total reward relatively to aleatoric uncertainty (i.e. the stochasticity of a given hypothesis), whereas the safety goal is a bound on the expected cost of overshooting the optimal quantilization parameter relatively to both aleatoric and epistemic uncertainty (i.e. uncertainty between different hypotheses). This secures the system against malign hypotheses that are trying to cause an overshoot.
Notice the hardening the system (i.e. making it more secure using conventional methods) increases the likelihood of the RDA. That is, although it is infeasible to manually harden the system against superintelligent attacks, it can very well be feasible to harden it against attacks available at the minimal level of optimization power (= quantilization parameter) required to meet the performance goal, at least to such an extent that the intended solution becomes more appealing for the AI.
AQRL involves the risk that the RDA is false, even though it seems true to the designers. To mitigate this risk, we can deploy honeypots. That is, we add auxiliary goals to our reward function s.t. (i) if the goal is reached, maximal total reward is awarded (ii) the goal is harmless in itself (iii) reaching the goal is still harder than the intended solution, but by a smaller margin than performing attacks (iv) we can detect the event of reaching the goal. For example, suppose that the AI is not supposed to have Internet access, so gaining Internet access is a path to a variety of possible attacks. Then, we can add an auxiliary goal of the form “send such-and-such innocuous packet to such-and-such IP address”. If the RDA is false and the AI succeeds gaining Internet access earlier (on the quantilization parameter scale) than reaching the performance goal, once the AI sent the packet it will randomize all its remaining actions (since it is operating at the earliest quantilization parameter which allows this). The users will then be alerted that security was breached and rethink their next steps.
Now, suppose that we can only specify the reward function via concepts that have to be learned from examples. This opens the risk of generalization failure. Such risk is not necessarily a problem since we are still protected by quantilization. However, it needs to be factored into our safety bounds. An especially concerning possibility is that the AI will interpret the concepts as “would be labeled by humans as such-and-such” instead of the intended interpretation, creating incentives to deceive. Moreover, this encourages learning human models which puts more pressure on the RDA by making it easier to attack.
In order to avoid learning human models, we can use methods along these lines. Specifically, suppose that, in addition to the dataset for learning out task-relevant concepts, we supply a dataset designed specifically for learning human models. We can then design an algorithm that produces models of the task-relevant concepts s.t. the human-centric dataset cannot be compressed w.r.t. them. That is, the algorithm makes sure that access to the models it produces doesn’t help with learning the human-centric dataset, implying that the concept models don’t contain human models. Moreover, we can extend this approach to the entire process, s.t. human models are not learned from any other source of information as well.
Notice that although this method excludes human models from the actual RL process, it encourages them in intermediate computations, since we are actively supplying a human-centric dataset to the system. But, this is not dangerous if the intermediate computation is hardened against non-Cartesian daemons (a big “if” but one we need to deal with anyway).