Rob Bensinger comments on AGI Ruin: A List of Lethalities

Rob Bensinger 6 Jun 2022 13:43 UTC
LW: 10 AF: 4
7
AF
Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.
I’m not sure this is true; or if it’s true, I’m not sure it’s relevant. But assuming it is true...
Therefore, if X is not entirely coherent then X’s preferences are only approximately defined, and hence we only need to infer them approximately.
… this strikes me as not capturing the aspect of human values that looks strange and complicated. Two ways I could imagine the strangeness and complexity cashing out as ‘EU-maximizer-ish’ are:
- Maybe I sort-of contain a lot of subagents, and ‘my values’ are the conjunction of my sub-agents’ values (where they don’t conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
- Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.
In both cases, the fact that my brain isn’t a single coherent EU maximizer seemingly makes things a lot harder and more finnicky, rather than making things easier. These are cases where you could say that my initial brain is ‘only approximately an agent’, and yet this comes with no implication that there’s any more room for error or imprecision than if I were an EU maximizer.
I’m not saying that the specific goals human have are natural: they are a complex accident of evolution. I’m saying that the general correspondence between agents and goals is natural.
Right, but this doesn’t on its own help get that specific relatively-natural concept into the AGI’s goals, except insofar as it suggests “the correspondence between agents and goals” is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.
- Vanessa Kosoy 6 Jun 2022 14:08 UTC
  LW: 8 AF: 3
  3
  AF Parent
  
  Second, the only reason why the question “what X wants” can make sense at all, is because X is an agent. As a corollary, it only makes sense to the extent that X is an agent.
  
  I’m not sure this is true; or if it’s true, I’m not sure it’s relevant.
  
  If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like “this (intuitively compelling) assumption is false” unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum. Which is to say, I find it self-evident that “agents” are exactly the sort of beings that can “want” things, because agency is about pursuing objectives and wanting is about the objectives that you pursue. If you don’t believe this then I don’t know what these words even mean for you.
  
  Maybe I sort-of contain a lot of subagents, and ‘my values’ are the conjunction of my sub-agents’ values (where they don’t conflict), plus the output of an idealized negotiation between my sub-agents (where they do conflict).
  
  Maybe, and maybe this means we need to treat “composite agents” explicitly in our models. But, there is also a case to be made that groups of (super)rational agents effectively converge into a single utility function, and if this is true, then the resulting system can just as well be interpreted as a single agent having this effective utility function, which is a solution that should satisfy the system of agents according to their existing bargaining equilibrium.
  
  Alternatively, maybe I have a bunch of inconsistent preferences, but I have a complicated pile of meta-preferences that collectively imply some chain of self-modifications and idealizations that end up producing something more coherent and utility-function-ish after a long sequence of steps.
  
  If your agent converges to optimal behavior asymptotically, then I suspect it’s still going to have infinite $g$ and therefore an asymptotically-crisply-defined utility function.
  
  Right, but this doesn’t on its own help get that specific relatively-natural concept into the AGI’s goals, except insofar as it suggests “the correspondence between agents and goals” is a simple concept, and any given simple concept is likelier to pop up in a goal than a more complex one.
  
  Of course it doesn’t help on its own. What I mean is, we are going to find a precise mathematical formalization of this concept and then hard-code this formalization into our AGI design.
  - Rob Bensinger 6 Jun 2022 20:22 UTC
    LW: 5 AF: 1
    3
    AF Parent
    If we go down that path then it becomes the sort of conversation where I have no idea what common assumptions do we have, if any, that we could use to agree. As a general rule, I find it unconstructive, for the purpose of trying to agree on anything, to say things like “this (intuitively compelling) assumption is false” unless you also provide a concrete argument or an alternative of your own. Otherwise the discussion is just ejected into vacuum.
    Fair enough! I don’t think I agree in general, but I think ‘OK, but what’s your alternative to agency?’ is an especially good case for this heuristic.
    Which is to say, I find it self-evident that “agents” are exactly the sort of beings that can “want” things, because agency is about pursuing objectives and wanting is about the objectives that you pursue.
    The first counter-example that popped into my head was “a mind that lacks any machinery for considering, evaluating, or selecting actions; but it does have machinery for experiencing more-pleasurable vs. less pleasurable states”. This is a mind we should be able to build, even if it would never evolve naturally.
    Possibly this still qualifies as an “agent” that “wants” and “pursues” things, as you conceive it, even though it doesn’t select actions?
    - Vanessa Kosoy 7 Jun 2022 6:23 UTC
      LW: 9 AF: 1
      1
      AF Parent
      My 0th approximation answer is: you’re describing something logically incoherent, like a p-zombie.
      
      My 1st approximation answer is more nuanced. Words that, in the pre-Turing era, referred exclusively to humans (and sometimes animals, and fictional beings), such as “wants”, “experiences” et cetera, might have two different referents. One referent is a natural concept, something tied into deep truths about how the universe (or multiverse) works. In particular, deep truths about the “relatively simple core structure that explains why complicated cognitive machines work”. The other referent is something in our specifically-human “ontological model” of the world (technically, I imagine that to be an infra-POMDP that all our hypotheses our refinements of). Since the latter is a “shard” of the former produced by evolution, the two referents are related, but might not be the same. (For example, I suspect that cats lack natural!consciousness but have human!consciousness.)
      
      The creature you describe does not natural!want anything. You postulated that it is “experiencing more pleasurable and less pleasurable states”, but there is no natural method that would label its states as such, or that would interpret them as any sort of “experience”. On the other hand, maybe if this creature is designed as a derivative of the human brain, then it does human!want something, because our shard of the concept of “wanting” mislabels (relatively to natural!want) weird states that wouldn’t occur in the ancestral environment.
      
      You can then ask, why should we design the AI to follow what we natural!want rather than what we human!want? To answer this, notice that, under ideal conditions, you converge to actions that maximize your natural!want, (more or less) according to definition of natural!want. In particular, under ideal conditions, you would build an AI that follows your natural!want. Hence, it makes sense to take a shortcut and “update now to the view you will predictably update to later”: namely, design the AI to follow your natural!want.