Could you explain this a bit more deeply? I get the feeling I’m missing something as I try to pop myself out of Human Mode and put myself in Math Agent Mode.
To my mind, evidence to a value learner is still evidence, value learning is an epistemic procedure. Just as we don’t optimize the experimental data to confirm a hypothesis in science… we don’t optimize the value-learning data to support a certain utility function, it’s irrational. It’s logical uncertainty to consider hypothetical scenarios: the true data on which I’m supposed to condition is the unoptimized data, so any action I take to alter the value-learning input data is in fact destroying information about what I’m supposed to value, it’s increasing my own logical uncertainty.
The intuition here is almost frequentist: my mind tells me that there exists a True Utility Function which is Out There, and my value-learning evidence constitutes data to learn what that function is, and if it’s my utility function, then my expected utility after learning always has a higher value than before learning, because learning reduces my uncertainty.
EDIT: Ok, I think this post makes more sense if I assume I’m thinking about an agent that doesn’t employ any form of logical uncertainty, and therefore has no epistemic access to the distinction between truth and self-delusion. Since the AI’s actions can’t cause its epistemic beliefs to fail to converge but can cause its moral beliefs to fail to converge (because morality is not ontologically fundamental), there is therefore a problem of making sure it doesn’t optimize its moral data away based on an incompletely learned utility function.
The problem is when you want to work with a young AI where the condition on which the utility function depends lies in the young AI’s decision-theoretic future. I.e. the AI is supposed to update on the value of an input field controlled by the programmers, but this input field (or even abstractions behind it like “the programmers’ current intentions”, should the AI already be mature enough to understand that) are things which can be affected by the AI. If the AI is not already very sophisticated, like more sophisticated than anyone presently has any good idea how to formally talk about, then in the process of building it, we’ll want to do “error correction” type things that the AI should accept even though we can’t yet state formally how they’re info about an event outside of the programmers and AI which neither can affect.
Roughly, the answer is: “That True Utility Function thing only works if the AI doesn’t think anything it can do affects the thing you defined as the True Utility Function. Defining something like that safely would represent a very advanced stage of maturity in the AI. For a young AI it’s much easier to talk about the value of an input field. Then we don’t want the AI trying to affect this input field. Armstrong’s trick is trying to make the AI with an easily describable input field have some of the same desirable properties as a much-harder-to-describe-at-our-present-stage-of-knowledge AI that has the true, safe, non-perversely-instantiable definition of how to learn about the True Utility Function.”
The actual future is your causal future, your future light cone. Your decision-theoretic future is anything that logically depends on the output of your decision function.
Could you explain this a bit more deeply? I get the feeling I’m missing something as I try to pop myself out of Human Mode and put myself in Math Agent Mode.
To my mind, evidence to a value learner is still evidence, value learning is an epistemic procedure. Just as we don’t optimize the experimental data to confirm a hypothesis in science… we don’t optimize the value-learning data to support a certain utility function, it’s irrational. It’s logical uncertainty to consider hypothetical scenarios: the true data on which I’m supposed to condition is the unoptimized data, so any action I take to alter the value-learning input data is in fact destroying information about what I’m supposed to value, it’s increasing my own logical uncertainty.
The intuition here is almost frequentist: my mind tells me that there exists a True Utility Function which is Out There, and my value-learning evidence constitutes data to learn what that function is, and if it’s my utility function, then my expected utility after learning always has a higher value than before learning, because learning reduces my uncertainty.
EDIT: Ok, I think this post makes more sense if I assume I’m thinking about an agent that doesn’t employ any form of logical uncertainty, and therefore has no epistemic access to the distinction between truth and self-delusion. Since the AI’s actions can’t cause its epistemic beliefs to fail to converge but can cause its moral beliefs to fail to converge (because morality is not ontologically fundamental), there is therefore a problem of making sure it doesn’t optimize its moral data away based on an incompletely learned utility function.
The problem is when you want to work with a young AI where the condition on which the utility function depends lies in the young AI’s decision-theoretic future. I.e. the AI is supposed to update on the value of an input field controlled by the programmers, but this input field (or even abstractions behind it like “the programmers’ current intentions”, should the AI already be mature enough to understand that) are things which can be affected by the AI. If the AI is not already very sophisticated, like more sophisticated than anyone presently has any good idea how to formally talk about, then in the process of building it, we’ll want to do “error correction” type things that the AI should accept even though we can’t yet state formally how they’re info about an event outside of the programmers and AI which neither can affect.
Roughly, the answer is: “That True Utility Function thing only works if the AI doesn’t think anything it can do affects the thing you defined as the True Utility Function. Defining something like that safely would represent a very advanced stage of maturity in the AI. For a young AI it’s much easier to talk about the value of an input field. Then we don’t want the AI trying to affect this input field. Armstrong’s trick is trying to make the AI with an easily describable input field have some of the same desirable properties as a much-harder-to-describe-at-our-present-stage-of-knowledge AI that has the true, safe, non-perversely-instantiable definition of how to learn about the True Utility Function.”
Right, ok, that’s actually substantially clearer after a night’s sleep.
One more question, semi-relevant: how is the decision-theoretic future different from the actual future?
The actual future is your causal future, your future light cone. Your decision-theoretic future is anything that logically depends on the output of your decision function.
This seems like a very useful idea—thanks!