Humans are at least a little coherent, or we would never get anything done; but we aren’t very coherent, so the project of piecing together ‘what does the human brain as a whole “want”’ can be vastly more difficult than the problem of figuring out what a coherent optimizer wants.
This is a point where I feel like I do have a substantial disagreement with the “conventional wisdom” of LessWrong.
First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I think that a lot of presumed irrationality is actually rational but deceptive behavior (where the deception runs so deep that it’s part of even our inner monologue). There are exceptions, like hyperbolic discounting, but not that many.
Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent. Therefore, if X is not entirely coherent then X’s preferences are only approximately defined, and hence we only need to infer them approximately. So, the added difficulty of inferring X’s preferences, resulting from the partial incoherence of these preference, is, to large extent, cancelled out by the reduction in the required precision of the answer. The way I expect this cache out is, when the agent has g<∞
, the utility function is only approximately defined, and we can infer it within this approximation. As g approaches infinity, the utility function becomes crisply defined[1] and can be inferred crisply. See also additional nuance in my answer to the cat question below.
This is not to say we shouldn’t investigate models like dynamically inconsistent preferences or “humans as systems of agents”, but that I expect the number of additional complications of this sort that are actually important to be not that great.
There are shards of planning and optimization and goal-oriented-ness in a cat’s brain, but ‘figure out what utopia would look like for a cat’ is a far harder problem than ‘identify all of the goal-encoding parts of the cat’s brain and “read off” those goals’. E.g., does ‘identifying utopia’ in this context involve uplifting or extrapolating the cat? Why, or why not? And if so, how does that process work?
I’m actually not sure that cats (as opposed to humans) are sufficiently “general” intelligence for the process to make sense. This is because I think humans are doing something like Turing RL (where consciousness plays the role of the “external computer”), and value learning is going to rely on that. The issue is, you don’t only need to infer the agent’s preferences but you also need to optimize them better than the agent itself. This might pose a difficulty, if, as I suggested above, imperfect agents have imperfectly defined preferences. While I can see several hypothetical solutions, the TRL model suggests a natural approach where the AI’s capability advantage is reduced to having a better external computer (and/or better interface with that computer). This might not apply to cats which (I’m guessing) don’t have this kind of consciousness[2] because (I’m guessing) the evolution of consciousness was tied to language and social behavior.
Getting a natural concept into an agent’s goal is a lot harder than getting it into an agent’s beliefs. Indeed, in the context of goals I’m not sure ‘naturalness’ actually helps at all, except insofar as natural kinds tend to be simple and simple targets are easier to hit?
I’m not saying that the specific goals human have are natural: they are a complex accident of evolution. I’m saying that the general correspondence between agents and goals is natural.
Asymptotically crisply: some changes are too small to affect the optimal policy, but I’m guessing that they become negligible when considering longer and longer timescales.
Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.
I’m not sure this is true; or if it’s true, I’m not sure it’s relevant. But assuming it is true...
Therefore, if X is not entirely coherent then X’s preferences are only approximately defined, and hence we only need to infer them approximately.
… this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and complexity cashing out as ‘EU-maximizer-ish’ are:
Maybe I sort-of contain a lot of subagents, and ‘my values’ are the conjunction of my sub-agents’ values (where they don’t conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.
In both cases, the fact that my brain isn’t a single coherent EU maximizer seemingly makes things a lot harder and more finnicky, rather than making things easier. These are cases where you could say that my initial brain is ‘only approximately an agent’, and yet this comes with no implication that there’s any more room for error or imprecision than if I were an EU maximizer.
I’m not saying that the specific goals human have are natural: they are a complex accident of evolution. I’m saying that the general correspondence between agents and goals is natural.
Right, but this doesn’t on its own help get that specific relatively-natural concept into the AGI’s goals, except insofar as it suggests “the correspondence between agents and goals” is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.
Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.
I’m not sure this is true; or if it’s true, I’m not sure it’s relevant.
If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like “this (intuitively compelling) assumption is false” unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum. Which is to say, I find it self-evident that “agents” are exactly the sort of beings that can “want” things, because agency is about pursuing objectives and wanting is about the objectives that you pursue. If you don’t believe this then I don’t know what these words even mean for you.
Maybe I sort-of contain a lot of subagents, and ‘my values’ are the conjunction of my sub-agents’ values (where they don’t conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
Maybe, and maybe this means we need to treat “composite agents” explicitly in our models. But, there is also a case to be made that groups of (super)rational agents effectively converge into a single utility function, and if this is true, then the resulting system can just as well be interpreted as a single agent having this effective utility function, which is a solution that should satisfy the system of agents according to their existing bargaining equilibrium.
Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.
If your agent converges to optimal behavior asymptotically, then I suspect it’s still going to have infinite g and therefore an asymptotically-crisply-defined utility function.
Right, but this doesn’t on its own help get that specific relatively-natural concept into the AGI’s goals, except insofar as it suggests “the correspondence between agents and goals” is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.
Of course it doesn’t help on its own. What I mean is, we are going to find a precise mathematical formalization of this concept and then hard-code this formalization into our AGI design.
If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like “this (intuitively compelling) assumption is false” unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum.
Fair enough! I don’t think I agree in general, but I think ‘OK, but what’s your alternative to agency?’ is an especially good case for this heuristic.
Which is to say, I find it self-evident that “agents” are exactly the sort of beings that can “want” things, because agency is about pursuing objectives and wanting is about the objectives that you pursue.
The first counter-example that popped into my head was “a mind that lacks any machinery for considering, evaluating, or selecting actions; but it does have machinery for experiencing more-pleasurable vs. less pleasurable states”. This is a mind we should be able to build, even if it would never evolve naturally.
Possibly this still qualifies as an “agent” that “wants” and “pursues” things, as you conceive it, even though it doesn’t select actions?
My 0th approximation answer is: you’re describing something logically incoherent, like a p-zombie.
My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as “wants”, “experiences” et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the “relatively simple core structure that explains why complicated cognitive machines work”. The other referent is something in our specifically-human “ontological model” of the world (technically, I imagine that to be an infra-POMDP that all our hypotheses our refinements of). Since the latter is a “shard” of the former produced by evolution, the two referents are related, but might not be the same. (For example, I suspect that cats lack natural!consciousness but have human!consciousness.)
The creature you describe does not natural!want anything. You postulated that it is “experiencing more pleasurable and less pleasurable states”, but there is no natural method that would label its states as such, or that would interpret them as any sort of “experience”. On the other hand, maybe if this creature is designed as a derivative of the human brain, then it does human!want something, because our shard of the concept of “wanting” mislabels (relatively to natural!want) weird states that wouldn’t occur in the ancestral environment.
You can then ask, why should we design the AI to follow what we natural!want rather than what we human!want? To answer this, notice that, under ideal conditions, you converge to actions that maximize your natural!want, (more or less) according to definition of natural!want. In particular, under ideal conditions, you would build an AI that follows your natural!want. Hence, it makes sense to take a shortcut and “update now to the view you will predictably update to later”: namely, design the AI to follow your natural!want.
This is a point where I feel like I do have a substantial disagreement with the “conventional wisdom” of LessWrong.
First, LessWrong began with a discussion of cognitive biases in human irrationality, so this naturally became a staple of the local narrative. On the other hand, I think that a lot of presumed irrationality is actually rational but deceptive behavior (where the deception runs so deep that it’s part of even our inner monologue). There are exceptions, like hyperbolic discounting, but not that many.
Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent. Therefore, if X is not entirely coherent then X’s preferences are only approximately defined, and hence we only need to infer them approximately. So, the added difficulty of inferring X’s preferences, resulting from the partial incoherence of these preference, is, to large extent, cancelled out by the reduction in the required precision of the answer. The way I expect this cache out is, when the agent has g<∞ , the utility function is only approximately defined, and we can infer it within this approximation. As g approaches infinity, the utility function becomes crisply defined[1] and can be inferred crisply. See also additional nuance in my answer to the cat question below.
This is not to say we shouldn’t investigate models like dynamically inconsistent preferences or “humans as systems of agents”, but that I expect the number of additional complications of this sort that are actually important to be not that great.
I’m actually not sure that cats (as opposed to humans) are sufficiently “general” intelligence for the process to make sense. This is because I think humans are doing something like Turing RL (where consciousness plays the role of the “external computer”), and value learning is going to rely on that. The issue is, you don’t only need to infer the agent’s preferences but you also need to optimize them better than the agent itself. This might pose a difficulty, if, as I suggested above, imperfect agents have imperfectly defined preferences. While I can see several hypothetical solutions, the TRL model suggests a natural approach where the AI’s capability advantage is reduced to having a better external computer (and/or better interface with that computer). This might not apply to cats which (I’m guessing) don’t have this kind of consciousness[2] because (I’m guessing) the evolution of consciousness was tied to language and social behavior.
I’m not saying that the specific goals human have are natural: they are a complex accident of evolution. I’m saying that the general correspondence between agents and goals is natural.
Asymptotically crisply: some changes are too small to affect the optimal policy, but I’m guessing that they become negligible when considering longer and longer timescales.
This is not to say cat’s don’t have quasimoral value: I think they do.
I’m not sure this is true; or if it’s true, I’m not sure it’s relevant. But assuming it is true...
… this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and complexity cashing out as ‘EU-maximizer-ish’ are:
Maybe I sort-of contain a lot of subagents, and ‘my values’ are the conjunction of my sub-agents’ values (where they don’t conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.
In both cases, the fact that my brain isn’t a single coherent EU maximizer seemingly makes things a lot harder and more finnicky, rather than making things easier. These are cases where you could say that my initial brain is ‘only approximately an agent’, and yet this comes with no implication that there’s any more room for error or imprecision than if I were an EU maximizer.
Right, but this doesn’t on its own help get that specific relatively-natural concept into the AGI’s goals, except insofar as it suggests “the correspondence between agents and goals” is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.
If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like “this (intuitively compelling) assumption is false” unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum. Which is to say, I find it self-evident that “agents” are exactly the sort of beings that can “want” things, because agency is about pursuing objectives and wanting is about the objectives that you pursue. If you don’t believe this then I don’t know what these words even mean for you.
Maybe, and maybe this means we need to treat “composite agents” explicitly in our models. But, there is also a case to be made that groups of (super)rational agents effectively converge into a single utility function, and if this is true, then the resulting system can just as well be interpreted as a single agent having this effective utility function, which is a solution that should satisfy the system of agents according to their existing bargaining equilibrium.
If your agent converges to optimal behavior asymptotically, then I suspect it’s still going to have infinite g and therefore an asymptotically-crisply-defined utility function.
Of course it doesn’t help on its own. What I mean is, we are going to find a precise mathematical formalization of this concept and then hard-code this formalization into our AGI design.
Fair enough! I don’t think I agree in general, but I think ‘OK, but what’s your alternative to agency?’ is an especially good case for this heuristic.
The first counter-example that popped into my head was “a mind that lacks any machinery for considering, evaluating, or selecting actions; but it does have machinery for experiencing more-pleasurable vs. less pleasurable states”. This is a mind we should be able to build, even if it would never evolve naturally.
Possibly this still qualifies as an “agent” that “wants” and “pursues” things, as you conceive it, even though it doesn’t select actions?
My 0th approximation answer is: you’re describing something logically incoherent, like a p-zombie.
My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as “wants”, “experiences” et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the “relatively simple core structure that explains why complicated cognitive machines work”. The other referent is something in our specifically-human “ontological model” of the world (technically, I imagine that to be an infra-POMDP that all our hypotheses our refinements of). Since the latter is a “shard” of the former produced by evolution, the two referents are related, but might not be the same. (For example, I suspect that cats lack natural!consciousness but have human!consciousness.)
The creature you describe does not natural!want anything. You postulated that it is “experiencing more pleasurable and less pleasurable states”, but there is no natural method that would label its states as such, or that would interpret them as any sort of “experience”. On the other hand, maybe if this creature is designed as a derivative of the human brain, then it does human!want something, because our shard of the concept of “wanting” mislabels (relatively to natural!want) weird states that wouldn’t occur in the ancestral environment.
You can then ask, why should we design the AI to follow what we natural!want rather than what we human!want? To answer this, notice that, under ideal conditions, you converge to actions that maximize your natural!want, (more or less) according to definition of natural!want. In particular, under ideal conditions, you would build an AI that follows your natural!want. Hence, it makes sense to take a shortcut and “update now to the view you will predictably update to later”: namely, design the AI to follow your natural!want.