To answer your thought experiment. It doesn’t matter what the agent thinks it’s acting based on, we look at it from the outside instead (but using a particular definition/dependence that specifies the agent), and ask how its action depends on the dependence of the actual future on its actual action. Agent’s misconceptions don’t enter this question. If the misconceptions are great, it’ll turn out that the dependence of actual future on agent’s action doesn’t control its action, or controls it in some unexpected way. Alternatively, we could say that it’s not the actual future that is morally relevant for the agent, but some other strange fact, in which case the agent could be said to be optimizing a world that is not ours. From yet another perspective, the role of the action could be played by something else, but then it’s not clear why we are considering such a model and talking about this particular actual agent at the same time.
Is that something you can see from the outside? If I argmax over actions in expected-paper-clips or over updateless-prior-expected-paper-clips, how can you translate my black box behavior over possible worlds into the dependence of my behavior on the dependence of the worlds on my behavior?
See the section “Utility functions” of this post: it shows how a dependence between two fixed facts could be restored in an ideal case where we can learn everything there is to learn about it. Similarly, you could consider the fact of which dependence holds between two facts, with various specific functions as its possible values, and ask what can you infer about that other fact if you assume that the dependence is given by a certain function.
More generally, a dependence follows possible inferences, things that could be inferred about one fact if you learn new things about the other fact. It needs to follow all of such inferences, to the best of agent’s ability, otherwise it won’t be right and you’ll get incorrect decisions (counterfactual models).
(Reading this comment first might be helpful.)
To answer your thought experiment. It doesn’t matter what the agent thinks it’s acting based on, we look at it from the outside instead (but using a particular definition/dependence that specifies the agent), and ask how its action depends on the dependence of the actual future on its actual action. Agent’s misconceptions don’t enter this question. If the misconceptions are great, it’ll turn out that the dependence of actual future on agent’s action doesn’t control its action, or controls it in some unexpected way. Alternatively, we could say that it’s not the actual future that is morally relevant for the agent, but some other strange fact, in which case the agent could be said to be optimizing a world that is not ours. From yet another perspective, the role of the action could be played by something else, but then it’s not clear why we are considering such a model and talking about this particular actual agent at the same time.
Is that something you can see from the outside? If I argmax over actions in expected-paper-clips or over updateless-prior-expected-paper-clips, how can you translate my black box behavior over possible worlds into the dependence of my behavior on the dependence of the worlds on my behavior?
See the section “Utility functions” of this post: it shows how a dependence between two fixed facts could be restored in an ideal case where we can learn everything there is to learn about it. Similarly, you could consider the fact of which dependence holds between two facts, with various specific functions as its possible values, and ask what can you infer about that other fact if you assume that the dependence is given by a certain function.
More generally, a dependence follows possible inferences, things that could be inferred about one fact if you learn new things about the other fact. It needs to follow all of such inferences, to the best of agent’s ability, otherwise it won’t be right and you’ll get incorrect decisions (counterfactual models).