I guess the obvious response is that we can instead simulate the internal workings of the human in detail, and thus uncover their simulation of post-episode events (as a past event).
So this is the sense in which I think my statement is technically correct. This is what μ⋆ literally does.
The next question is whether it is correct in way that isn’t fragile once we start considering fast/simple approximations of μ⋆. You’re right that there is more to discuss here than I discuss in the paper: if a human’s simulation of the future has ε fidelity, and the world-model itself has ∼ε fidelity, then a clever memory-based world-model could reuse the computation of the human’s prediction of the future when it is computing the actual future. If it hasn’t spent much computation time “going down the wrong path” there isn’t much that’s lost for having done so.
I don’t expect the human operator will be simulating/imagining all post-episode events that are relevant for ε-accurate predictions of future episodes. ε-accurate world-models have to simulate all the outside-world events that are necessary to get within an ε threshold of understanding how episodes affect each other, and it won’t be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won’t be perfectly useful to future computation.
So it seems like you have a theory that could collapse the human value system into an (mostly non-moral) “moral value system” (or, as Eliezer would put it, “the moral value system”)
(Note that I am not asserting that the moral value system (or the human metaethics) is necessarily stable—or that there’s a good and bad reason for not to value things in the first place.)
A few background observations:
A very few “real world” situations would be relevant here.
As an example, the following possible worlds are very interesting but I will focus on a couple:
The micro class and the macro class seem fairly different at first glance.
There is a very different class of micro-worlds available from a relatively small amount of resources.
The following world hypothetical would be clearly very different from the usual, and that looks very different than there’s a vastly smaller class of micro-worlds available to the same amount of resources.
At first I assumed that they were entirely plausible worlds. Then I assumed they were plausible to me.
Then I assumed there’s an overall level of plausibility that different people really do but have the same probability mass and the same amount of energy/effort.
The above causal leap isn’t that much of an argument.
The following examples, taken from Eliezer:
(It seems like Eliezer’s assumption of an “intended life”, in the sense of a non-extended life, is simply not true)
These seem to be completely reasonable and reasonably frequent enough that I’m reasonably sure they’re reasonable.
“In a world that never presents itself, there is no reason for this to be a problem.”
(A quick check of self-reference and how that’s not what it’s about seem relevant, though this sounds to me like a strawman.)
So this is the sense in which I think my statement is technically correct. This is what μ⋆ literally does.
The next question is whether it is correct in way that isn’t fragile once we start considering fast/simple approximations of μ⋆. You’re right that there is more to discuss here than I discuss in the paper: if a human’s simulation of the future has ε fidelity, and the world-model itself has ∼ε fidelity, then a clever memory-based world-model could reuse the computation of the human’s prediction of the future when it is computing the actual future. If it hasn’t spent much computation time “going down the wrong path” there isn’t much that’s lost for having done so.
I don’t expect the human operator will be simulating/imagining all post-episode events that are relevant for ε-accurate predictions of future episodes. ε-accurate world-models have to simulate all the outside-world events that are necessary to get within an ε threshold of understanding how episodes affect each other, and it won’t be necessary for the human operator to consider all this. So I think that even for approximately accurate world-models, following the wrong counterfactual won’t be perfectly useful to future computation.
So it seems like you have a theory that could collapse the human value system into an (mostly non-moral) “moral value system” (or, as Eliezer would put it, “the moral value system”)
(Note that I am not asserting that the moral value system (or the human metaethics) is necessarily stable—or that there’s a good and bad reason for not to value things in the first place.)
A few background observations:
A very few “real world” situations would be relevant here.
As an example, the following possible worlds are very interesting but I will focus on a couple:
The micro class and the macro class seem fairly different at first glance.
There is a very different class of micro-worlds available from a relatively small amount of resources.
The following world hypothetical would be clearly very different from the usual, and that looks very different than there’s a vastly smaller class of micro-worlds available to the same amount of resources.
At first I assumed that they were entirely plausible worlds. Then I assumed they were plausible to me.
Then I assumed there’s an overall level of plausibility that different people really do but have the same probability mass and the same amount of energy/effort.
The above causal leap isn’t that much of an argument.
The following examples, taken from Eliezer:
(It seems like Eliezer’s assumption of an “intended life”, in the sense of a non-extended life, is simply not true)
These seem to be completely reasonable and reasonably frequent enough that I’m reasonably sure they’re reasonable.
“In a world that never presents itself, there is no reason for this to be a problem.”
(A quick check of self-reference and how that’s not what it’s about seem relevant, though this sounds to me like a strawman.)