This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally.
I’ve long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them.
In Solomonoff Model, Sufficiently Large Data Rules Out Malignness
There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that:
A simple application of the no free lunch theorem shows that there is no way of making predictions that is better than the Solomonoff prior across all possible distributions over all possible strings. Thus, agents that are influencing the Solomonoff prior cannot be good at predicting, and thus gain influence, in all possible worlds.
… but in the large-data limit, SI’s guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit.
Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.)
… but then how the hell does this outside-view argument jive with all the inside-view arguments about malign agents in the prior?
Reflection Breaks The Large-Data Guarantees
There’s an important gotcha in those guarantees: in the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. SI itself is not computable, therefore the guarantees do not apply to worlds which contain more than a single instance of Solomonoff induction, or worlds whose behavior depends on the Solomonoff inductor’s outputs.
One example of this is AIXI (basically a Solomonoff inductor hooked up to a reward learning system): because AIXI’s future data stream depends on its own present actions, the SI guarantees break down; takeover by a malign agent in the prior is no longer blocked by the SI guarantees.
Predict-O-Matic is a similar example: that story depends on the potential for self-fulfilling prophecies, which requires that the world’s behavior depend on the predictor’s output.
We could also break the large-data guarantees by making a copy of the Solomonoff inductor, using the copy to predict what the original will predict, and then choosing outcomes so that the original inductor’s guesses are all wrong. Then any random program which will outperform the inductor’s predictions. But again, this environment itself contains a Solomonoff inductor, so it’s not computable; it’s no surprise that the guarantees break.
(Interesting technical side question: this sort of reflection issue is exactly the sort of thing Logical Inductors were made for. Does the large-data guarantee of SI generalize to Logical Inductors in a way which handles reflection better? I do not know the answer.)
If Reflection Breaks The Guarantees, Then Why Does This Matter?
The real world does in fact contain lots of agents, and real-world agents’ predictions do in fact influence the world’s behavior. So presumably (allowing for uncertainty about this handwavy argument) the maligness of the Solomonoff prior should carry over to realistic use-cases, right? So why does this tangent matter in the first place?
Well, it matters because we’re left with an importantly different picture: maligness is not a property of SI itself, so much as a property of SI in specific environments. Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much. We need specific external conditions—like feedback loops or other agents—in order for malignness to kick in. Colloquially speaking, it is not strictly an “inner” problem; it is a problem which depends heavily on the “outer” conditions.
If we think of malignness of SI just in terms of malign inner agents taking over, as in the post, then the problem seems largely decoupled from the specifics of the objective (i.e. accurate prediction) and environment. If that were the case, then malign inner agents would be a very neatly-defined subproblem of alignment—a problem which we could work on without needing to worry about alignment of the outer objective or reflection or embeddedness in the environment. But unfortunately the problem does not cleanly factor like that; the large-data guarantees and their breakdown show that malignness of SI is very tightly coupled to outer alignment and reflection and embeddedness and all that.
Now for one stronger claim. We don’t need malign inner agent arguments to conclude that SI handles reflection and embeddedness poorly; we already knew that. Reflection and embedded world-models are already problems in need of solving, for many different reasons. The fact that malign agents in the hypothesis space are relevant for SI only in the cases where we already knew SI breaks suggests that, once we have better ways of handling reflection and embeddedness in general, the malign inner agents problem will go away on its own. This kind of malign inner agent is not a subproblem which we need to worry about in its own right. Indeed, I expect this is probably the case: once we have good ways of handling reflection and embeddedness in general, the problem of malign agents in the hypothesis space will go away on its own. (Infra-Bayesianism might be a case in point, though I haven’t studied it enough myself to be confident in that.)
Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much
It seems like you can get malign behavior if you assume:
There are some important decisions on which you can’t get feedback.
There are malign agents in the prior who can recognize those decisions.
In that case the malign agents can always defect only on important decisions where you can’t get feedback.
I agree that if you can get feedback on all important decisions (and actually have time to recover from a catastrophe after getting the feedback) then malignness of the universal prior isn’t important.
I don’t have a clear picture of how handling embededness or reflection would make this problem go away, though I haven’t thought about it carefully. For example, if you replace Solomonoff induction with a reflective oracle it seems like you have an identical problem, does that seem right to you? And similarly it seems like a creature who uses mathematical reasoning to estimate features of the universal prior would be vulnerable to similar pathologies even in a universe that is computable.
ETA: that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.
I don’t have a clear picture of how handling embededness or reflection would make this problem go away, though I haven’t thought about it carefully.
Infra-Bayesian physicalism does ameliorate the problem by handling “embededness”. Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn’t get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.
that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.
Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the “messed up situation”?
Infra-Bayesian physicalism does ameliorate the problem by handling “embededness”. Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn’t get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.
I agree that removing bridge hypotheses removes one of the advantages for malign hypotheses. I didn’t mention this because it doesn’t seem like the way in which john is using “embededness;” for example, it seems orthogonal to the way in which the situation violates the conditions for solomonoff induction to be eventually correct. I’d stand by saying that it doesn’t appear to make the problem go away.
That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses (since otherwise they also get big benefits from the influence update). And then once you’ve done that in a sensible way it seems like it also addresses any issues with embededness (though maybe we just want to say that those are being solved inside the decision theory). If you want to recover the expected behavior of induction as a component of intelligent reasoning (rather than a component of the utility function + an instrumental step in intelligent reasoning) then it seems like you need a more different tack.
Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the “messed up situation”?
If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources. If you do the same induction but just remove the malign hypotheses, then it seems like you are even dumber and the problem is even worse viewed from the competitiveness perspective.
I’d stand by saying that it doesn’t appear to make the problem go away.
Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.
That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses
I’m not sure I understand what you mean by “decision-theoretic approach”. This attack vector has structure similar to acausal bargaining (between the AI and the attacker), so plausibly some decision theories that block acausal bargaining can rule out this as well. Is this what you mean?
If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources.
This seems wrong to me. The inductor doesn’t literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn’t imply any wastefulness.
Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.
It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don’t work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn’t enough to get you there.
I’m not sure I understand what you mean by “decision-theoretic approach”
I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no).
This seems wrong to me. The inductor doesn’t literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn’t imply any wastefulness.
I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it’s not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).
It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don’t work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn’t enough to get you there.
Why is embededness not enough? Once you don’t have bridge rules, what is left is the laws of physics. What does the malign hypothesis explain about the laws of physics that the true hypothesis doesn’t explain?
I suspect (but don’t have a proof or even a theorem statement) that IB physicalism produces some kind of agreement theorem for different agents within the same universe, which would guarantee that the user and the AI should converge to the same beliefs (provided that both of them follow IBP).
I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences...
I’m not sure I follow your reasoning, but IBP sort of does that. In IBP we don’t have subjective expectations per se, only an equation for how to “updatelessly” evaluate different policies.
I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it’s not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).
Okay, but suppose that the AI has real evidence for the simulation hypothesis (evidence that we would consider valid). For example, suppose that there is some metacosmological explanation for the precise value of the fine structure constant (not in the sense of, this is the value which supports life, but in the sense of, this is the value that simulators like to simulate). Do you agree that in this case it is completely rational for the AI to reason about the world via reasoning about the simulators?
I’m not sure I follow your reasoning, but IBP sort of does that. In IBP we don’t have subjective expectations per se, only an equation for how to “updatelessly” evaluate different policies.
It seems like any approach that evaluates policies based on their consequences is fine, isn’t it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.
I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?
It seems like any approach that evaluates policies based on their consequences is fine, isn’t it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.
Why? Maybe you’re thinking of UDT? In which case, it’s sort of true but IBP is precisely a formalization of UDT + extra nuance regarding the input of the utility function.
I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent.
Well, IBP is explained here. I’m not sure what kind of non-IBP agent you’re imagining.
I like the feedback framing, it seems to get closer to the heart-of-the-thing than my explanation did. It makes the role of the pointers problem and latent variables more clear, which in turn makes the role of outer alignment more clear. When writing my review, I kept thinking that it seemed like reflection and embeddedness and outer alignment all needed to be figured out to deal with this kind of malign inner agent, but I didn’t have a good explanation for the outer alignment part, so I focused mainly on reflection and embeddedness.
That said, I think the right frame here involves “feedback” in a more general sense than I think you’re imagining it. In particular, I don’t think catastrophes are very relevant.
The role of “feedback” here is mainly informational; it’s about the ability to tell which decision is correct. The thing-we-want from the “feedback” is something like the large-data guarantee from SI: we want to be able to train the system on a bunch of data before asking it for any output, and we want that training to wipe out the influence of any malign agents in the hypothesis space. If there’s some class of decisions where we can’t tell which decision is correct, and a malign inner agent can recognize that class, then presumably we can’t create the training data we need.
With that picture in mind, the ability to give feedback “online” isn’t particularly relevant, and therefore catastrophes are not particularly central. We only need “feedback” in the sense that we can tell which decision we want, in any class of problems which any agent in the hypothesis space can recognize, in order to create a suitable dataset.
We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can’t do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)
🤔 Some people talk about human ideologies as “egregores” which have independent agency. I had previously modelled them as just being a simple sort of emergent behavior, but this post makes me think that maybe they could be seen as malign inner agents embedded in world models, since they seem to cover the domain where you describe inner agents as being relevant (modelling over other agents).
This post is an excellent distillation of a cluster of past work on maligness of Solomonoff Induction, which has become a foundational argument/model for inner agency and malign models more generally.
I’ve long thought that the maligness argument overlooks some major counterarguments, but I never got around to writing them up. Now that this post is up for the 2020 review, seems like a good time to walk through them.
In Solomonoff Model, Sufficiently Large Data Rules Out Malignness
There is a major outside-view reason to expect that the Solomonoff-is-malign argument must be doing something fishy: Solomonoff Induction (SI) comes with performance guarantees. In the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. The post mentions that:
… but in the large-data limit, SI’s guarantees are stronger than just that. In the large-data limit, there is no computable way of making better predictions than the Solomonoff prior in any world. Thus, agents that are influencing the Solomonoff prior cannot gain long-term influence in any computable world; they have zero degrees of freedom to use for influence. It does not matter if they specialize in influencing worlds in which they have short strings; they still cannot use any degrees of freedom for influence without losing all their influence in the large-data limit.
Takeaway of this argument: as long as we throw enough data at our Solomonoff inductor before asking it for any outputs, the malign agent problem must go away. (Though note that we never know exactly how much data that is; all we have is a big-O argument with an uncomputable constant.)
… but then how the hell does this outside-view argument jive with all the inside-view arguments about malign agents in the prior?
Reflection Breaks The Large-Data Guarantees
There’s an important gotcha in those guarantees: in the limit of large data, SI performs as well as the best-predicting program, in every computably-generated world. SI itself is not computable, therefore the guarantees do not apply to worlds which contain more than a single instance of Solomonoff induction, or worlds whose behavior depends on the Solomonoff inductor’s outputs.
One example of this is AIXI (basically a Solomonoff inductor hooked up to a reward learning system): because AIXI’s future data stream depends on its own present actions, the SI guarantees break down; takeover by a malign agent in the prior is no longer blocked by the SI guarantees.
Predict-O-Matic is a similar example: that story depends on the potential for self-fulfilling prophecies, which requires that the world’s behavior depend on the predictor’s output.
We could also break the large-data guarantees by making a copy of the Solomonoff inductor, using the copy to predict what the original will predict, and then choosing outcomes so that the original inductor’s guesses are all wrong. Then any random program which will outperform the inductor’s predictions. But again, this environment itself contains a Solomonoff inductor, so it’s not computable; it’s no surprise that the guarantees break.
(Interesting technical side question: this sort of reflection issue is exactly the sort of thing Logical Inductors were made for. Does the large-data guarantee of SI generalize to Logical Inductors in a way which handles reflection better? I do not know the answer.)
If Reflection Breaks The Guarantees, Then Why Does This Matter?
The real world does in fact contain lots of agents, and real-world agents’ predictions do in fact influence the world’s behavior. So presumably (allowing for uncertainty about this handwavy argument) the maligness of the Solomonoff prior should carry over to realistic use-cases, right? So why does this tangent matter in the first place?
Well, it matters because we’re left with an importantly different picture: maligness is not a property of SI itself, so much as a property of SI in specific environments. Merely having malign agents in the hypothesis space is not enough for the malign agents to take over in general; the large data guarantees show that much. We need specific external conditions—like feedback loops or other agents—in order for malignness to kick in. Colloquially speaking, it is not strictly an “inner” problem; it is a problem which depends heavily on the “outer” conditions.
If we think of malignness of SI just in terms of malign inner agents taking over, as in the post, then the problem seems largely decoupled from the specifics of the objective (i.e. accurate prediction) and environment. If that were the case, then malign inner agents would be a very neatly-defined subproblem of alignment—a problem which we could work on without needing to worry about alignment of the outer objective or reflection or embeddedness in the environment. But unfortunately the problem does not cleanly factor like that; the large-data guarantees and their breakdown show that malignness of SI is very tightly coupled to outer alignment and reflection and embeddedness and all that.
Now for one stronger claim. We don’t need malign inner agent arguments to conclude that SI handles reflection and embeddedness poorly; we already knew that. Reflection and embedded world-models are already problems in need of solving, for many different reasons. The fact that malign agents in the hypothesis space are relevant for SI only in the cases where we already knew SI breaks suggests that, once we have better ways of handling reflection and embeddedness in general, the malign inner agents problem will go away on its own. This kind of malign inner agent is not a subproblem which we need to worry about in its own right. Indeed, I expect this is probably the case: once we have good ways of handling reflection and embeddedness in general, the problem of malign agents in the hypothesis space will go away on its own. (Infra-Bayesianism might be a case in point, though I haven’t studied it enough myself to be confident in that.)
It seems like you can get malign behavior if you assume:
There are some important decisions on which you can’t get feedback.
There are malign agents in the prior who can recognize those decisions.
In that case the malign agents can always defect only on important decisions where you can’t get feedback.
I agree that if you can get feedback on all important decisions (and actually have time to recover from a catastrophe after getting the feedback) then malignness of the universal prior isn’t important.
I don’t have a clear picture of how handling embededness or reflection would make this problem go away, though I haven’t thought about it carefully. For example, if you replace Solomonoff induction with a reflective oracle it seems like you have an identical problem, does that seem right to you? And similarly it seems like a creature who uses mathematical reasoning to estimate features of the universal prior would be vulnerable to similar pathologies even in a universe that is computable.
ETA: that all said I agree that the malignness of the universal prior is unlikely to be very important in realistic cases, and the difficulty stems from a pretty messed up situation that we want to avoid for other reasons. Namely, you want to avoid being so much weaker than agents inside of your prior.
Infra-Bayesian physicalism does ameliorate the problem by handling “embededness”. Specifically, it ameliorates it by removing the need to have bridge rules in your hypotheses. This doesn’t get rid of malign hypotheses entirely, but it does mean they no longer have an astronomical advantage in complexity over the true hypothesis.
Can you elaborate on this? Why is it unlikely in realistic cases, and what other reason do we have to avoid the “messed up situation”?
I agree that removing bridge hypotheses removes one of the advantages for malign hypotheses. I didn’t mention this because it doesn’t seem like the way in which john is using “embededness;” for example, it seems orthogonal to the way in which the situation violates the conditions for solomonoff induction to be eventually correct. I’d stand by saying that it doesn’t appear to make the problem go away.
That said, it seems to me like you basically need to take a decision-theoretic approach to have any hope of ruling out malign hypotheses (since otherwise they also get big benefits from the influence update). And then once you’ve done that in a sensible way it seems like it also addresses any issues with embededness (though maybe we just want to say that those are being solved inside the decision theory). If you want to recover the expected behavior of induction as a component of intelligent reasoning (rather than a component of the utility function + an instrumental step in intelligent reasoning) then it seems like you need a more different tack.
If your inductor actually finds and runs a hypothesis much smarter than you, then you are doing a terrible job ) of using your resources, since you are trying to be ~as smart as you can using all of the available resources. If you do the same induction but just remove the malign hypotheses, then it seems like you are even dumber and the problem is even worse viewed from the competitiveness perspective.
Sure. But it becomes much more amenable to methods such as confidence thresholds, which are applicable to some alignment protocols at least.
I’m not sure I understand what you mean by “decision-theoretic approach”. This attack vector has structure similar to acausal bargaining (between the AI and the attacker), so plausibly some decision theories that block acausal bargaining can rule out this as well. Is this what you mean?
This seems wrong to me. The inductor doesn’t literally simulate the attacker. It reasons about the attacker (using some theory of metacosmology) and infers what the attacker would do, which doesn’t imply any wastefulness.
It seems like you have to get close to eliminating malign hypotheses in order to apply such methods (i.e. they don’t work once malign hypotheses have > 99.9999999% of probability, so you need to ensure that benign hypothesis description is within 30 bits of the good hypothesis), and embededness alone isn’t enough to get you there.
I mean that you have some utility function, are choosing actions based on E[utility|action], and perform solomonoff induction only instrumentally because it suggests ways in which your own decision is correlated with utility. There is still something like the universal prior in the definition of utility, but it no longer cares at all about your particular experiences (and if you try to define utility in terms of solomonoff induction applied to your experiences, e.g. by learning a human, then it seems again vulnerable to attack bridging hypotheses or no).
I agree that the situation is better when solomonoff induction is something you are reasoning about rather than an approximate description of your reasoning. In that case it’s not completely pathological, but it still seems bad in a similar way to reason about the world by reasoning about other agents reasoning about the world (rather than by direct learning the lessons that those agents have learned and applying those lessons in the same way that those agents would apply them).
Why is embededness not enough? Once you don’t have bridge rules, what is left is the laws of physics. What does the malign hypothesis explain about the laws of physics that the true hypothesis doesn’t explain?
I suspect (but don’t have a proof or even a theorem statement) that IB physicalism produces some kind of agreement theorem for different agents within the same universe, which would guarantee that the user and the AI should converge to the same beliefs (provided that both of them follow IBP).
I’m not sure I follow your reasoning, but IBP sort of does that. In IBP we don’t have subjective expectations per se, only an equation for how to “updatelessly” evaluate different policies.
Okay, but suppose that the AI has real evidence for the simulation hypothesis (evidence that we would consider valid). For example, suppose that there is some metacosmological explanation for the precise value of the fine structure constant (not in the sense of, this is the value which supports life, but in the sense of, this is the value that simulators like to simulate). Do you agree that in this case it is completely rational for the AI to reason about the world via reasoning about the simulators?
It seems like any approach that evaluates policies based on their consequences is fine, isn’t it? That is, malign hypotheses dominate the posterior for my experiences, but not for things I consider morally valuable.
I may just not be understanding the proposal for how the IBP agent differs from the non-IBP agent. It seems like we are discussing a version that defines values differently, but where neither agent uses Solomonoff induction directly. Is that right?
Why? Maybe you’re thinking of UDT? In which case, it’s sort of true but IBP is precisely a formalization of UDT + extra nuance regarding the input of the utility function.
Well, IBP is explained here. I’m not sure what kind of non-IBP agent you’re imagining.
I like the feedback framing, it seems to get closer to the heart-of-the-thing than my explanation did. It makes the role of the pointers problem and latent variables more clear, which in turn makes the role of outer alignment more clear. When writing my review, I kept thinking that it seemed like reflection and embeddedness and outer alignment all needed to be figured out to deal with this kind of malign inner agent, but I didn’t have a good explanation for the outer alignment part, so I focused mainly on reflection and embeddedness.
That said, I think the right frame here involves “feedback” in a more general sense than I think you’re imagining it. In particular, I don’t think catastrophes are very relevant.
The role of “feedback” here is mainly informational; it’s about the ability to tell which decision is correct. The thing-we-want from the “feedback” is something like the large-data guarantee from SI: we want to be able to train the system on a bunch of data before asking it for any output, and we want that training to wipe out the influence of any malign agents in the hypothesis space. If there’s some class of decisions where we can’t tell which decision is correct, and a malign inner agent can recognize that class, then presumably we can’t create the training data we need.
With that picture in mind, the ability to give feedback “online” isn’t particularly relevant, and therefore catastrophes are not particularly central. We only need “feedback” in the sense that we can tell which decision we want, in any class of problems which any agent in the hypothesis space can recognize, in order to create a suitable dataset.
We need to be able to both tell what decision we want, and identify the relevant inputs on which to train. We could either somehow identify the relevant decisions in advance (e.g. as in adversarial training), or we could do it after the fact by training online if there are no catastrophes, but it seems like we need to get those inputs one way or the other. If there are catastrophes and we can’t do adversarial training, then even if we can tell which decision we want in any given case we can still die the first time we encounter an input where the system behaves catastrophically. (Or more realistically during the first wave where our systems are all exposed to such catastrophe-inducing inputs simultaneously.)
🤔 Some people talk about human ideologies as “egregores” which have independent agency. I had previously modelled them as just being a simple sort of emergent behavior, but this post makes me think that maybe they could be seen as malign inner agents embedded in world models, since they seem to cover the domain where you describe inner agents as being relevant (modelling over other agents).