The problem is when you want to work with a young AI where the condition on which the utility function depends lies in the young AI’s decision-theoretic future. I.e. the AI is supposed to update on the value of an input field controlled by the programmers, but this input field (or even abstractions behind it like “the programmers’ current intentions”, should the AI already be mature enough to understand that) are things which can be affected by the AI. If the AI is not already very sophisticated, like more sophisticated than anyone presently has any good idea how to formally talk about, then in the process of building it, we’ll want to do “error correction” type things that the AI should accept even though we can’t yet state formally how they’re info about an event outside of the programmers and AI which neither can affect.
Roughly, the answer is: “That True Utility Function thing only works if the AI doesn’t think anything it can do affects the thing you defined as the True Utility Function. Defining something like that safely would represent a very advanced stage of maturity in the AI. For a young AI it’s much easier to talk about the value of an input field. Then we don’t want the AI trying to affect this input field. Armstrong’s trick is trying to make the AI with an easily describable input field have some of the same desirable properties as a much-harder-to-describe-at-our-present-stage-of-knowledge AI that has the true, safe, non-perversely-instantiable definition of how to learn about the True Utility Function.”
The actual future is your causal future, your future light cone. Your decision-theoretic future is anything that logically depends on the output of your decision function.
The problem is when you want to work with a young AI where the condition on which the utility function depends lies in the young AI’s decision-theoretic future. I.e. the AI is supposed to update on the value of an input field controlled by the programmers, but this input field (or even abstractions behind it like “the programmers’ current intentions”, should the AI already be mature enough to understand that) are things which can be affected by the AI. If the AI is not already very sophisticated, like more sophisticated than anyone presently has any good idea how to formally talk about, then in the process of building it, we’ll want to do “error correction” type things that the AI should accept even though we can’t yet state formally how they’re info about an event outside of the programmers and AI which neither can affect.
Roughly, the answer is: “That True Utility Function thing only works if the AI doesn’t think anything it can do affects the thing you defined as the True Utility Function. Defining something like that safely would represent a very advanced stage of maturity in the AI. For a young AI it’s much easier to talk about the value of an input field. Then we don’t want the AI trying to affect this input field. Armstrong’s trick is trying to make the AI with an easily describable input field have some of the same desirable properties as a much-harder-to-describe-at-our-present-stage-of-knowledge AI that has the true, safe, non-perversely-instantiable definition of how to learn about the True Utility Function.”
Right, ok, that’s actually substantially clearer after a night’s sleep.
One more question, semi-relevant: how is the decision-theoretic future different from the actual future?
The actual future is your causal future, your future light cone. Your decision-theoretic future is anything that logically depends on the output of your decision function.
This seems like a very useful idea—thanks!