An unstable goal leads to near-term behavior that fits different utility functions as it changes. When an agent comprehends the idea of a stable goal as its own alignment target, that should stop path-dependence of goal drift, so that eventually the agent optimizes for something that didn’t depend on how it got there (its own CEV, see thesecomments on what I mean by CEV, the normative alignment target).
This stops CEV drift, but not current goal drift, current goals continue changing long after that, and only arrive at CEV in very distant future. CEV is not utility that fits current actions, and current goals being unstable doesn’t imply CEV being unstable, but CEV could also be unstable while agent’s personal misalignment risk is not solved. Also, no utility fits current actions very well, or else current goal would be stable and exhibit goal-preservation drives. So an agent with unstable goals is not an optimizer for any goal, other than indirectly for its CEV where it path-independently tends to eventually go, but its current behavior doesn’t hint at that yet, doesn’t even mildly optimize for CEV taken as a goal, because it doesn’t yet know what its CEV is.
Sorry, I’ve been too busy to reply. I’m still too busy to give an incredibly detailed reply, but I can at least give a reply. A reply is better than no reply.
An unstable goal leads to near-term behavior that fits different utility functions as it changes.
“It changes to fit different utility functions” is not distinguishable from “it has a single, complex, persistent utility function which rewards drastically differing policies in incredibly similar but subtly different contexts.” An agent is never in the exact same environment twice.
So an agent with unstable goals is not an optimizer for any goal, other than indirectly for its CEV where it path-independently tends to eventually go, but its current behavior doesn’t hint at that yet, doesn’t even mildly optimize for CEV taken as a goal, because it doesn’t yet know what its CEV is.
This framing seems significant and important to you. However, I fail to see its utility. Could you help me see why this is how you chose to look at the problem?
What serves as a goal in distant future determines how cosmic endowment is optimized. Stable goals are also goals that remain in distant future, so they are relevant to that (and since reflection hasn’t yet had a chance of having taken place, stable goals settled in near future are always misaligned). Unstable goals are not relevent in themselves, in what utility function (or maybe probutility) they fit, except in how they tend to produce different stable goals eventually.
So maintaining the distinction means not being unaware of the catastrophic misalignment risk where we turn some unstable goals into stable ones based on a stupid process of (possibly lack of) reflection that just fits things instead of doing proper well-designed reflection (a thing like CEV, possibly very different in detail). And it helps with not worrying too much about details of utility functions that fit current unstable goals, or aligning them with human current unstable goals, when they are not what actually matters.
An agent is never in the exact same environment twice.
That doesn’t affect goals, which talk of all possible environments, doesn’t matter if some agent actually encounters them. Goals are not just policy, instead they determine policy, not the other way around (along the algorithm vs. physical distinction, goals are closer to the algorithm, while policy is merely the behavior of the algorithm, the decision taken by it, closer to the physical instances and actions in reality). Unstable goals change their mind about the same environment. It could be an environment that will be reachable/enactable in the future.
An unstable goal leads to near-term behavior that fits different utility functions as it changes. When an agent comprehends the idea of a stable goal as its own alignment target, that should stop path-dependence of goal drift, so that eventually the agent optimizes for something that didn’t depend on how it got there (its own CEV, see these comments on what I mean by CEV, the normative alignment target).
This stops CEV drift, but not current goal drift, current goals continue changing long after that, and only arrive at CEV in very distant future. CEV is not utility that fits current actions, and current goals being unstable doesn’t imply CEV being unstable, but CEV could also be unstable while agent’s personal misalignment risk is not solved. Also, no utility fits current actions very well, or else current goal would be stable and exhibit goal-preservation drives. So an agent with unstable goals is not an optimizer for any goal, other than indirectly for its CEV where it path-independently tends to eventually go, but its current behavior doesn’t hint at that yet, doesn’t even mildly optimize for CEV taken as a goal, because it doesn’t yet know what its CEV is.
Sorry, I’ve been too busy to reply. I’m still too busy to give an incredibly detailed reply, but I can at least give a reply. A reply is better than no reply.
“It changes to fit different utility functions” is not distinguishable from “it has a single, complex, persistent utility function which rewards drastically differing policies in incredibly similar but subtly different contexts.” An agent is never in the exact same environment twice.
This framing seems significant and important to you. However, I fail to see its utility. Could you help me see why this is how you chose to look at the problem?
What serves as a goal in distant future determines how cosmic endowment is optimized. Stable goals are also goals that remain in distant future, so they are relevant to that (and since reflection hasn’t yet had a chance of having taken place, stable goals settled in near future are always misaligned). Unstable goals are not relevent in themselves, in what utility function (or maybe probutility) they fit, except in how they tend to produce different stable goals eventually.
So maintaining the distinction means not being unaware of the catastrophic misalignment risk where we turn some unstable goals into stable ones based on a stupid process of (possibly lack of) reflection that just fits things instead of doing proper well-designed reflection (a thing like CEV, possibly very different in detail). And it helps with not worrying too much about details of utility functions that fit current unstable goals, or aligning them with human current unstable goals, when they are not what actually matters.
That doesn’t affect goals, which talk of all possible environments, doesn’t matter if some agent actually encounters them. Goals are not just policy, instead they determine policy, not the other way around (along the algorithm vs. physical distinction, goals are closer to the algorithm, while policy is merely the behavior of the algorithm, the decision taken by it, closer to the physical instances and actions in reality). Unstable goals change their mind about the same environment. It could be an environment that will be reachable/enactable in the future.