I wouldn’t really use the term “incentives” to describe the free-for-all among world-models as they compete to be maximum a posteriori. All they have to do is output observations and rewards in a distribution that matches the objective probabilities. But I think we arrive at the same possibility: you’ll see in the algorithm for ν⋆ that it does simulate the outside-world.
I do acknowledge in the paper that some of the outside-world simulation that a memory-based world-model does when it’s following the “wrong path” may turn out to be useful; all that is required for the argument to go through is that this simulation is not perfectly useful—there is a shorter computation that accomplishes the same thing.
I would love it if this assumption could look like: “the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual” and make assumption 2 into a lemma that follows from it, but I couldn’t figure out how to formalize this.
Ah yes—I was confusing myself at some point between forming and using a model (hence “incentives”).
I think you’re correct that “perfectly useful” isn’t going to happen. I’m happy to be wrong.
“the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual”
I don’t think you’d be able to formalize this in general, since I imagine it’s not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch of a mutually exclusive counterfactual. In such a case, simulating one counterfactual could be perfectly useful to the other. (I suppose you’d still expect it to be an operation or so slower, due to extra indirection, but perhaps that could be optimised away??)
To rule this kind of thing out, I think you’d need more specific assumptions (e.g. physics-based).
This doesn’t seem to address what I view as the heart of Joe’s comment. Quoting from the paper:
“Now we note that µ* is the fastest world-model for on-policy prediction, and it does not simulate post-episode events until it has read access to the random action”.
It seems like simulating *post-episode* events in particular would be useful for predicting the human’s responses, because they will be simulating post-episode events when they choose their actions. Intuitively, it seems like we *need* to simulate post-episode events to have any hope of guessing how the human will act. I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event). That seems correct, but also a bit troubling (again, probably just for “revealed preferences” reasons, though).
Moreover, I think in practice we’ll want to use models that make good, but not perfect, predictions. That means that we trade-off accuracy with description length, and I think this makes modeling the outside world (instead of the human’s model of it) potentially more appealing, at least in some cases.
I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event).
So this is the sense in which I think my statement is technically correct. This is what μ⋆ literally does.
The next question is whether it is correct in way that isn’t fragile once we start considering fast/simple approximations of μ⋆. You’re right that there is more to discuss here than I discuss in the paper: if a human’s simulation of the future has ε fidelity, and the world-model itself has ∼ε fidelity, then a clever memory-based world-model could reuse the computation of the human’s prediction of the future when it is computing the actual future. If it hasn’t spent much computation time “going down the wrong path” there isn’t much that’s lost for having done so.
I don’t expect the human operator will be simulating/imagining all post-episode events that are relevant for ε-accurate predictions of future episodes. ε-accurate world-models have to simulate all the outside-world events that are necessary to get within an ε threshold of understanding how episodes affect each other, and it won’t be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won’t be perfectly useful to future computation.
So it seems like you have a theory that could collapse the human value system into an (mostly non-moral) “moral value system” (or, as Eliezer would put it, “the moral value system”)
(Note that I am not asserting that the moral value system (or the human metaethics) is necessarily stable—or that there’s a good and bad reason for not to value things in the first place.)
A few background observations:
A very few “real world” situations would be relevant here.
As an example, the following possible worlds are very interesting but I will focus on a couple:
The micro class and the macro class seem fairly different at first glance.
There is a very different class of micro-worlds available from a relatively small amount of resources.
The following world hypothetical would be clearly very different from the usual, and that looks very different than there’s a vastly smaller class of micro-worlds available to the same amount of resources.
At first I assumed that they were entirely plausible worlds. Then I assumed they were plausible to me.
Then I assumed there’s an overall level of plausibility that different people really do but have the same probability mass and the same amount of energy/effort.
The above causal leap isn’t that much of an argument.
The following examples, taken from Eliezer:
(It seems like Eliezer’s assumption of an “intended life”, in the sense of a non-extended life, is simply not true)
These seem to be completely reasonable and reasonably frequent enough that I’m reasonably sure they’re reasonable.
“In a world that never presents itself, there is no reason for this to be a problem.”
(A quick check of self-reference and how that’s not what it’s about seem relevant, though this sounds to me like a strawman.)
I wouldn’t really use the term “incentives” to describe the free-for-all among world-models as they compete to be maximum a posteriori. All they have to do is output observations and rewards in a distribution that matches the objective probabilities. But I think we arrive at the same possibility: you’ll see in the algorithm for ν⋆ that it does simulate the outside-world.
I do acknowledge in the paper that some of the outside-world simulation that a memory-based world-model does when it’s following the “wrong path” may turn out to be useful; all that is required for the argument to go through is that this simulation is not perfectly useful—there is a shorter computation that accomplishes the same thing.
I would love it if this assumption could look like: “the quickest way to simulate one counterfactual does not include simulating a mutually exclusive counterfactual” and make assumption 2 into a lemma that follows from it, but I couldn’t figure out how to formalize this.
Ah yes—I was confusing myself at some point between forming and using a model (hence “incentives”).
I think you’re correct that “perfectly useful” isn’t going to happen. I’m happy to be wrong.
I don’t think you’d be able to formalize this in general, since I imagine it’s not true. E.g. one could imagine a fractal world where every detail of a counterfactual appeared later in a subbranch of a mutually exclusive counterfactual. In such a case, simulating one counterfactual could be perfectly useful to the other. (I suppose you’d still expect it to be an operation or so slower, due to extra indirection, but perhaps that could be optimised away??)
To rule this kind of thing out, I think you’d need more specific assumptions (e.g. physics-based).
This doesn’t seem to address what I view as the heart of Joe’s comment. Quoting from the paper:
“Now we note that µ* is the fastest world-model for on-policy prediction, and it does not simulate post-episode events until it has read access to the random action”.
It seems like simulating *post-episode* events in particular would be useful for predicting the human’s responses, because they will be simulating post-episode events when they choose their actions. Intuitively, it seems like we *need* to simulate post-episode events to have any hope of guessing how the human will act. I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event). That seems correct, but also a bit troubling (again, probably just for “revealed preferences” reasons, though).
Moreover, I think in practice we’ll want to use models that make good, but not perfect, predictions. That means that we trade-off accuracy with description length, and I think this makes modeling the outside world (instead of the human’s model of it) potentially more appealing, at least in some cases.
So this is the sense in which I think my statement is technically correct. This is what μ⋆ literally does.
The next question is whether it is correct in way that isn’t fragile once we start considering fast/simple approximations of μ⋆. You’re right that there is more to discuss here than I discuss in the paper: if a human’s simulation of the future has ε fidelity, and the world-model itself has ∼ε fidelity, then a clever memory-based world-model could reuse the computation of the human’s prediction of the future when it is computing the actual future. If it hasn’t spent much computation time “going down the wrong path” there isn’t much that’s lost for having done so.
I don’t expect the human operator will be simulating/imagining all post-episode events that are relevant for ε-accurate predictions of future episodes. ε-accurate world-models have to simulate all the outside-world events that are necessary to get within an ε threshold of understanding how episodes affect each other, and it won’t be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won’t be perfectly useful to future computation.
So it seems like you have a theory that could collapse the human value system into an (mostly non-moral) “moral value system” (or, as Eliezer would put it, “the moral value system”)
(Note that I am not asserting that the moral value system (or the human metaethics) is necessarily stable—or that there’s a good and bad reason for not to value things in the first place.)
A few background observations:
A very few “real world” situations would be relevant here.
As an example, the following possible worlds are very interesting but I will focus on a couple:
The micro class and the macro class seem fairly different at first glance.
There is a very different class of micro-worlds available from a relatively small amount of resources.
The following world hypothetical would be clearly very different from the usual, and that looks very different than there’s a vastly smaller class of micro-worlds available to the same amount of resources.
At first I assumed that they were entirely plausible worlds. Then I assumed they were plausible to me.
Then I assumed there’s an overall level of plausibility that different people really do but have the same probability mass and the same amount of energy/effort.
The above causal leap isn’t that much of an argument.
The following examples, taken from Eliezer:
(It seems like Eliezer’s assumption of an “intended life”, in the sense of a non-extended life, is simply not true)
These seem to be completely reasonable and reasonably frequent enough that I’m reasonably sure they’re reasonable.
“In a world that never presents itself, there is no reason for this to be a problem.”
(A quick check of self-reference and how that’s not what it’s about seem relevant, though this sounds to me like a strawman.)