Let the set of potential human explorer actions be AE, and the best human explorer action be a∗E with reward r∗E. Consider the following world model. When asked to predict the result of an action a, it simulates it to find the predicted observation o and reward r. If a∈AE, it outputs o and r faithfully. Otherwise, it outputs any reward it chooses, as long as for the action that it assigns the maximum reward to, it reports r faithfully. In practice, this means that the world model can get BoMAI to choose any action it wants, as long as it is at least as good as the human explorer’s best action. This is bad if the world model has malicious inner optimizers.
I believe that Assumption 2 is the one responsible for ruling out a model of this form. It seems probably reasonable to say that for actions where it continues simulating the outside world there’s useless computation. However, it can also save computation relative to μ∗: while μ∗ must predict o and r perfectly for all actions a, this model can immediately output a null observation and zero reward for any a∉AE that it knows will not align with its goals, rather than spending computation to simulate what rewards those actions would get. Another way of thinking about this is that this model uses consequentialist general intelligence to quickly prune away uninteresting non-human actions to save on computation, but that general intelligence comes at the price of misaligned goals + deceptive behavior.
The real result of the paper would then be “Asymptotic Benignity, proven in a way that involves off-policy predictions approaching their benign output without ever being tested”.
I think the model above has arbitrarily bad off-policy predictions, and it’s not implausible for it to be the MAP world model forever.
In practice, this means that the world model can get BoMAI to choose any action it wants
So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.
Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI—getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.
Actually, I’m not sure if the world model I described is memory-based. EDIT: Never mind, see Michael’s comment below, the non-benign ones are memory-based.
The rewards it outputs are correct, except when it says “the reward is zero”, but those exceptions are not causally dependent on outside-world features that causally depend on the actions of the episode. But it also satisfies Lemma 3. So in that case it seems like none of the theoretical arguments prohibit this world model?
(I think this is because you are thinking of how the world model can cause BoMAI to have bad incentives, whereas here I’m postulating that the world model itself has malicious goals and is manipulating BoMAI.)
[Quite possibly I’m confused, but in case I’m not:] I think this assumption might be invalid (or perhaps require more hand-waving than is ideal).
The AI has an incentive to understand the operator’s mind, since this bears directly on its reward. Better understanding the operator’s mind might be achieved in part by running simulations including the operator. One specific simulation would involve simulating the operator’s environment and actions after he leaves the room.
Here this isn’t done to understand the implications of his actions (which can’t affect the episode); it’s done to better understand his mind (which can).
In this way, one branch of forget/not-forget has two useful purposes (better understand mind and simulate future), while the other has one (better understand mind). So a malign memory-based model needn’t be slower than a benign model, if it’s useful for that benign model to simulate the future too. So either I’m confused, or the justification for the assumption isn’t valid. Hopefully the former :).
If I’m right, then what you seem to need is an assumption that simulating the outside-world’s future can’t be helpful in the AI’s prediction of its reward. To me, this seems like major hand-waving territory.
I wouldn’t really use the term “incentives” to describe the free-for-all among world-models as they compete to be maximum a posteriori. All they have to do is output observations and rewards in a distribution that matches the objective probabilities. But I think we arrive at the same possibility: you’ll see in the algorithm for ν⋆ that it does simulate the outside-world.
I do acknowledge in the paper that some of the outside-world simulation that a memory-based world-model does when it’s following the “wrong path” may turn out to be useful; all that is required for the argument to go through is that this simulation is not perfectly useful—there is a shorter computation that accomplishes the same thing.
I would love it if this assumption could look like: “the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual” and make assumption 2 into a lemma that follows from it, but I couldn’t figure out how to formalize this.
Ah yes—I was confusing myself at some point between forming and using a model (hence “incentives”).
I think you’re correct that “perfectly useful” isn’t going to happen. I’m happy to be wrong.
“the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual”
I don’t think you’d be able to formalize this in general, since I imagine it’s not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch of a mutually exclusive counterfactual. In such a case, simulating one counterfactual could be perfectly useful to the other. (I suppose you’d still expect it to be an operation or so slower, due to extra indirection, but perhaps that could be optimised away??)
To rule this kind of thing out, I think you’d need more specific assumptions (e.g. physics-based).
This doesn’t seem to address what I view as the heart of Joe’s comment. Quoting from the paper:
“Now we note that µ* is the fastest world-model for on-policy prediction, and it does not simulate post-episode events until it has read access to the random action”.
It seems like simulating *post-episode* events in particular would be useful for predicting the human’s responses, because they will be simulating post-episode events when they choose their actions. Intuitively, it seems like we *need* to simulate post-episode events to have any hope of guessing how the human will act. I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event). That seems correct, but also a bit troubling (again, probably just for “revealed preferences” reasons, though).
Moreover, I think in practice we’ll want to use models that make good, but not perfect, predictions. That means that we trade-off accuracy with description length, and I think this makes modeling the outside world (instead of the human’s model of it) potentially more appealing, at least in some cases.
I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event).
So this is the sense in which I think my statement is technically correct. This is what μ⋆ literally does.
The next question is whether it is correct in way that isn’t fragile once we start considering fast/simple approximations of μ⋆. You’re right that there is more to discuss here than I discuss in the paper: if a human’s simulation of the future has ε fidelity, and the world-model itself has ∼ε fidelity, then a clever memory-based world-model could reuse the computation of the human’s prediction of the future when it is computing the actual future. If it hasn’t spent much computation time “going down the wrong path” there isn’t much that’s lost for having done so.
I don’t expect the human operator will be simulating/imagining all post-episode events that are relevant for ε-accurate predictions of future episodes. ε-accurate world-models have to simulate all the outside-world events that are necessary to get within an ε threshold of understanding how episodes affect each other, and it won’t be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won’t be perfectly useful to future computation.
So it seems like you have a theory that could collapse the human value system into an (mostly non-moral) “moral value system” (or, as Eliezer would put it, “the moral value system”)
(Note that I am not asserting that the moral value system (or the human metaethics) is necessarily stable—or that there’s a good and bad reason for not to value things in the first place.)
A few background observations:
A very few “real world” situations would be relevant here.
As an example, the following possible worlds are very interesting but I will focus on a couple:
The micro class and the macro class seem fairly different at first glance.
There is a very different class of micro-worlds available from a relatively small amount of resources.
The following world hypothetical would be clearly very different from the usual, and that looks very different than there’s a vastly smaller class of micro-worlds available to the same amount of resources.
At first I assumed that they were entirely plausible worlds. Then I assumed they were plausible to me.
Then I assumed there’s an overall level of plausibility that different people really do but have the same probability mass and the same amount of energy/effort.
The above causal leap isn’t that much of an argument.
The following examples, taken from Eliezer:
(It seems like Eliezer’s assumption of an “intended life”, in the sense of a non-extended life, is simply not true)
These seem to be completely reasonable and reasonably frequent enough that I’m reasonably sure they’re reasonable.
“In a world that never presents itself, there is no reason for this to be a problem.”
(A quick check of self-reference and how that’s not what it’s about seem relevant, though this sounds to me like a strawman.)
Comment thread: concerns with Assumption 2
Let the set of potential human explorer actions be AE, and the best human explorer action be a∗E with reward r∗E. Consider the following world model. When asked to predict the result of an action a, it simulates it to find the predicted observation o and reward r. If a∈AE, it outputs o and r faithfully. Otherwise, it outputs any reward it chooses, as long as for the action that it assigns the maximum reward to, it reports r faithfully. In practice, this means that the world model can get BoMAI to choose any action it wants, as long as it is at least as good as the human explorer’s best action. This is bad if the world model has malicious inner optimizers.
I believe that Assumption 2 is the one responsible for ruling out a model of this form. It seems probably reasonable to say that for actions where it continues simulating the outside world there’s useless computation. However, it can also save computation relative to μ∗: while μ∗ must predict o and r perfectly for all actions a, this model can immediately output a null observation and zero reward for any a∉AE that it knows will not align with its goals, rather than spending computation to simulate what rewards those actions would get. Another way of thinking about this is that this model uses consequentialist general intelligence to quickly prune away uninteresting non-human actions to save on computation, but that general intelligence comes at the price of misaligned goals + deceptive behavior.
Also, from this comment:
I think the model above has arbitrarily bad off-policy predictions, and it’s not implausible for it to be the MAP world model forever.
This is an interesting world-model.
So really this is a set of world-models, one for every algorithm for picking actions to present as optimal to BoMAI. Depending on how the actions are chosen by the world-model, either it will be ruled out by Assumption 2 or it will be benign.
Suppose the choice of action depends on outside-world features. (This would be the point of manipulating BoMAI—getting it to take actions with particular outside-world effects). Then, the feature that this world-model associates reward with depends on outside-world events that depend on actions taken, and is ruled out by Assumption 2. And as the parenthetical mentions, if the world-model is not selecting actions to advertise as high-reward based on the outside-world effects of those actions, then the world-model is benign.
Only the on-policy computation is accounted for.
Actually, I’m not sure if the world model I described is memory-based. EDIT: Never mind, see Michael’s comment below, the non-benign ones are memory-based.
The rewards it outputs are correct, except when it says “the reward is zero”, but those exceptions are not causally dependent on outside-world features that causally depend on the actions of the episode. But it also satisfies Lemma 3. So in that case it seems like none of the theoretical arguments prohibit this world model?
(I think this is because you are thinking of how the world model can cause BoMAI to have bad incentives, whereas here I’m postulating that the world model itself has malicious goals and is manipulating BoMAI.)
[Quite possibly I’m confused, but in case I’m not:]
I think this assumption might be invalid (or perhaps require more hand-waving than is ideal).
The AI has an incentive to understand the operator’s mind, since this bears directly on its reward.
Better understanding the operator’s mind might be achieved in part by running simulations including the operator.
One specific simulation would involve simulating the operator’s environment and actions after he leaves the room.
Here this isn’t done to understand the implications of his actions (which can’t affect the episode); it’s done to better understand his mind (which can).
In this way, one branch of forget/not-forget has two useful purposes (better understand mind and simulate future), while the other has one (better understand mind). So a malign memory-based model needn’t be slower than a benign model, if it’s useful for that benign model to simulate the future too.
So either I’m confused, or the justification for the assumption isn’t valid. Hopefully the former :).
If I’m right, then what you seem to need is an assumption that simulating the outside-world’s future can’t be helpful in the AI’s prediction of its reward. To me, this seems like major hand-waving territory.
I wouldn’t really use the term “incentives” to describe the free-for-all among world-models as they compete to be maximum a posteriori. All they have to do is output observations and rewards in a distribution that matches the objective probabilities. But I think we arrive at the same possibility: you’ll see in the algorithm for ν⋆ that it does simulate the outside-world.
I do acknowledge in the paper that some of the outside-world simulation that a memory-based world-model does when it’s following the “wrong path” may turn out to be useful; all that is required for the argument to go through is that this simulation is not perfectly useful—there is a shorter computation that accomplishes the same thing.
I would love it if this assumption could look like: “the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual” and make assumption 2 into a lemma that follows from it, but I couldn’t figure out how to formalize this.
Ah yes—I was confusing myself at some point between forming and using a model (hence “incentives”).
I think you’re correct that “perfectly useful” isn’t going to happen. I’m happy to be wrong.
I don’t think you’d be able to formalize this in general, since I imagine it’s not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch of a mutually exclusive counterfactual. In such a case, simulating one counterfactual could be perfectly useful to the other. (I suppose you’d still expect it to be an operation or so slower, due to extra indirection, but perhaps that could be optimised away??)
To rule this kind of thing out, I think you’d need more specific assumptions (e.g. physics-based).
This doesn’t seem to address what I view as the heart of Joe’s comment. Quoting from the paper:
“Now we note that µ* is the fastest world-model for on-policy prediction, and it does not simulate post-episode events until it has read access to the random action”.
It seems like simulating *post-episode* events in particular would be useful for predicting the human’s responses, because they will be simulating post-episode events when they choose their actions. Intuitively, it seems like we *need* to simulate post-episode events to have any hope of guessing how the human will act. I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event). That seems correct, but also a bit troubling (again, probably just for “revealed preferences” reasons, though).
Moreover, I think in practice we’ll want to use models that make good, but not perfect, predictions. That means that we trade-off accuracy with description length, and I think this makes modeling the outside world (instead of the human’s model of it) potentially more appealing, at least in some cases.
So this is the sense in which I think my statement is technically correct. This is what μ⋆ literally does.
The next question is whether it is correct in way that isn’t fragile once we start considering fast/simple approximations of μ⋆. You’re right that there is more to discuss here than I discuss in the paper: if a human’s simulation of the future has ε fidelity, and the world-model itself has ∼ε fidelity, then a clever memory-based world-model could reuse the computation of the human’s prediction of the future when it is computing the actual future. If it hasn’t spent much computation time “going down the wrong path” there isn’t much that’s lost for having done so.
I don’t expect the human operator will be simulating/imagining all post-episode events that are relevant for ε-accurate predictions of future episodes. ε-accurate world-models have to simulate all the outside-world events that are necessary to get within an ε threshold of understanding how episodes affect each other, and it won’t be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won’t be perfectly useful to future computation.
So it seems like you have a theory that could collapse the human value system into an (mostly non-moral) “moral value system” (or, as Eliezer would put it, “the moral value system”)
(Note that I am not asserting that the moral value system (or the human metaethics) is necessarily stable—or that there’s a good and bad reason for not to value things in the first place.)
A few background observations:
A very few “real world” situations would be relevant here.
As an example, the following possible worlds are very interesting but I will focus on a couple:
The micro class and the macro class seem fairly different at first glance.
There is a very different class of micro-worlds available from a relatively small amount of resources.
The following world hypothetical would be clearly very different from the usual, and that looks very different than there’s a vastly smaller class of micro-worlds available to the same amount of resources.
At first I assumed that they were entirely plausible worlds. Then I assumed they were plausible to me.
Then I assumed there’s an overall level of plausibility that different people really do but have the same probability mass and the same amount of energy/effort.
The above causal leap isn’t that much of an argument.
The following examples, taken from Eliezer:
(It seems like Eliezer’s assumption of an “intended life”, in the sense of a non-extended life, is simply not true)
These seem to be completely reasonable and reasonably frequent enough that I’m reasonably sure they’re reasonable.
“In a world that never presents itself, there is no reason for this to be a problem.”
(A quick check of self-reference and how that’s not what it’s about seem relevant, though this sounds to me like a strawman.)