As a strong default, STEM-level AGIs will have “goals”—or will at least look from the outside like they do. By this I mean that they’ll select outputs that competently steer the world toward particular states.
Clarification: when talking about world-states I mean world-state minus the state of agent (we are interested in the external actions of the agent).
For starters, you can have goal-directed behavior without steering the world toward particular states. Novelty seeking, for example, don’t imply any particular world-state to achieve.
And I think that the more strong default is that agent will have goal uncertainty. What reinforcement learning agent can say about its desired world-states or world-histories (the goal might not be expressible as an utility function over world-states) upon introspection? Nothing. Would it conclude that its goal is to make sure to self-stimulate as long as possible? Given its vast knowledge of humans, the idea looks fairly dumb (it has low prior probability) and its realization contradict almost any other possibility.
The only kind of agent that will know its goal with certainty is an agent that was programmed with its preferences explicitly pointing to the external world. That is upon introspection the agent finds that its action selection circuitry contains a module that compares expected world-states (or world-state/action pairs) produced by the given set of actions. That is someone was dumb enough to try to program explicit utility function, but secured sufficient funding anyway (completely possible situation, I agree).
But does it really removes goal uncertainty? Sufficiently intelligent agent knows that its utility function is an approximation of true preferences of the creator. That is prior probability of “stated goal == true goal” is infinitesimal (alignment is hard and agent knows it). Will it be enough to prevent the usual “kill-them-all and make tiny molecular squiggles”? The agent still has a choice of which actions to feed to the action selection block.
For starters, you can have goal-directed behavior without steering the world toward particular states. Novelty seeking, for example, don’t imply any particular world-state to achieve.
If you look from the outside like you’re competently trying to steer the world into states that will result in you getting more novel experience, then this is “goal-directed” in the sense I mean, regardless of why you’re doing that.
If you (e.g.) look from the outside like you’re selecting the local action that’s least like the actions you’ve selected before, regardless of how that affects you or your future novel experience, etc., then that’s not “goal-directed” in the sense I mean.
The distinction isn’t meant to be totally crisp (there are different degrees and dimensions of “goal-directedness”), but maybe these examples help clarify what I have in mind. “Maximize novel experience” is a pretty vague goal, but it’s not so vague that I think it falls outside of what I had in mind—e.g., I think the standard instrumental convergence concerns apply to “maximize novel experience”.
“Steer the world toward there being an even number of planets in the Milky Way Galaxy” also encompasses a variety of possible world-states (more than half of the possible worlds where the Milky Way Galaxy exists are optimal), but I think the arguments in the OP apply just as well to this goal.
Sufficiently intelligent agent knows that its utility function is an approximation of true preferences of the creator.
Nope! Humans were created by evolution, but our true utility function isn’t “maximize inclusive reproductive fitness” (nor is it some slightly tweaked version of that goal).
We know that evolution has no preferences (evolution is not an agent), so we generally don’t frame our preferences as an approximation of evolution’s ones. People who believe that they were created with some goal in mind of the creator do engage in reasoning of what was truly meant for them to do.
The provided link assumes that any preference can be expressed as a utility function over world-states. If you don’t assume that (and you shouldn’t as human preferences can’t be expressed as such), you cannot maximize weighted average of potential utility functions. Some actions are preference-wise irreversible. Take for example virtue ethics: wiping out your memory doesn’t restore your status as a virtuous person even if the world doesn’t contain any information of your unvirtuous acts anymore, so you don’t plan to do that.
When I asked here earlier why the article “Problem of Fully Updated Deference” uses incorrect assumption, I’ve got the answer that it’s better to have some approximation than none as it allows to move forward in exploring the problem of alignment. But I see that it became an unconditional cornerstone and not a toy example of analysis.
Clarification: when talking about world-states I mean world-state minus the state of agent (we are interested in the external actions of the agent).
For starters, you can have goal-directed behavior without steering the world toward particular states. Novelty seeking, for example, don’t imply any particular world-state to achieve.
And I think that the more strong default is that agent will have goal uncertainty. What reinforcement learning agent can say about its desired world-states or world-histories (the goal might not be expressible as an utility function over world-states) upon introspection? Nothing. Would it conclude that its goal is to make sure to self-stimulate as long as possible? Given its vast knowledge of humans, the idea looks fairly dumb (it has low prior probability) and its realization contradict almost any other possibility.
The only kind of agent that will know its goal with certainty is an agent that was programmed with its preferences explicitly pointing to the external world. That is upon introspection the agent finds that its action selection circuitry contains a module that compares expected world-states (or world-state/action pairs) produced by the given set of actions. That is someone was dumb enough to try to program explicit utility function, but secured sufficient funding anyway (completely possible situation, I agree).
But does it really removes goal uncertainty? Sufficiently intelligent agent knows that its utility function is an approximation of true preferences of the creator. That is prior probability of “stated goal == true goal” is infinitesimal (alignment is hard and agent knows it). Will it be enough to prevent the usual “kill-them-all and make tiny molecular squiggles”? The agent still has a choice of which actions to feed to the action selection block.
If you look from the outside like you’re competently trying to steer the world into states that will result in you getting more novel experience, then this is “goal-directed” in the sense I mean, regardless of why you’re doing that.
If you (e.g.) look from the outside like you’re selecting the local action that’s least like the actions you’ve selected before, regardless of how that affects you or your future novel experience, etc., then that’s not “goal-directed” in the sense I mean.
The distinction isn’t meant to be totally crisp (there are different degrees and dimensions of “goal-directedness”), but maybe these examples help clarify what I have in mind. “Maximize novel experience” is a pretty vague goal, but it’s not so vague that I think it falls outside of what I had in mind—e.g., I think the standard instrumental convergence concerns apply to “maximize novel experience”.
“Steer the world toward there being an even number of planets in the Milky Way Galaxy” also encompasses a variety of possible world-states (more than half of the possible worlds where the Milky Way Galaxy exists are optimal), but I think the arguments in the OP apply just as well to this goal.
Nope! Humans were created by evolution, but our true utility function isn’t “maximize inclusive reproductive fitness” (nor is it some slightly tweaked version of that goal).
See also, in the OP: “Problem of Fully Updated Deference: Normative uncertainty doesn’t address the core obstacles to corrigibility.”
We know that evolution has no preferences (evolution is not an agent), so we generally don’t frame our preferences as an approximation of evolution’s ones. People who believe that they were created with some goal in mind of the creator do engage in reasoning of what was truly meant for them to do.
The provided link assumes that any preference can be expressed as a utility function over world-states. If you don’t assume that (and you shouldn’t as human preferences can’t be expressed as such), you cannot maximize weighted average of potential utility functions. Some actions are preference-wise irreversible. Take for example virtue ethics: wiping out your memory doesn’t restore your status as a virtuous person even if the world doesn’t contain any information of your unvirtuous acts anymore, so you don’t plan to do that.
When I asked here earlier why the article “Problem of Fully Updated Deference” uses incorrect assumption, I’ve got the answer that it’s better to have some approximation than none as it allows to move forward in exploring the problem of alignment. But I see that it became an unconditional cornerstone and not a toy example of analysis.