TurnTrout comments on A central AI alignment problem: capabilities generalization, and the sharp left turn

TurnTrout 30 Jun 2022 1:38 UTC
LW: 19 AF: 8
10
AF
These two problems appear in the strawberry problem, which Eliezer’s been pointing at for quite some time: the problem of getting an AI to place two identical (down to the cellular but not molecular level) strawberries on a plate, and then do nothing else. The demand of cellular-level copying forces the AI to be capable; the fact that we can get it to duplicate a strawberry instead of doing some other thing demonstrates our ability to direct it; the fact that it does nothing else indicates that it’s corrigible (or really well aligned to a delicate human intuitive notion of inaction).
Let T be the target objective we wish to align the system towards. In this paragraph, T is duplicating the strawberry (and doing nothing else). I seriously doubt that alignment is comparably difficult for all objectives T, or even all “reasonable” objectives. I think building an AGI which solves the strawberry problem is far harder than building an AGI which makes lots of dogs in the future. I think that a diamond maximizer is also harder to train than an AGI which makes lots of dogs, but an AGI which makes lots of diamonds is probably only a little harder than an AGI which makes lots of dogs.
(I haven’t explained why I hold these beliefs, but I figured it would be productive to at least note this axis of disagreement.)
When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there’s no guarantee there’s even a reasonable solution. We’d probably be better off thinking about how to make AGIs which care about dogs, because it’s an empirical fact that there’s a way to align intelligences to that objective.
What links here?
- Rob Bensinger 30 Jun 2022 19:15 UTC
  LW: 11 AF: 7
  −1
  AF Parent
  When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there’s no guarantee there’s even a reasonable solution.
  Why would there not be a solution?
  - TurnTrout 3 Jul 2022 2:03 UTC
    LW: 14 AF: 7
    7
    AF Parent
    To clarify, I said there might not be a reasonable solution (i.e. such that solving the strawberry problem isn’t significantly harder than solving pivotal-act alignment).
    Not directly answering your Q, but here’s why it seems unnatural and maybe misleading-of-attention. Copied from a Slack message I sent:
    First, I suspect that even an aligned AI would fail the “duplicate a strawberry and do nothing else” challenge, because such an AI would care about human life and/or about cooperating with humans, and would be asked to stand by while 1.8 humans die each second and the world inches closer to doom via unaligned AI. (And so it also seems to need to lack a self-preservation drive)
    
    Since building an aligned AI is not a sufficient condition, this opens the possibility that the strawberry problem is actually harder than the alignment task we need to solve to come out of AGI alive.
    Two separate difficulties with this particular challenge:
    I think “duplicate a single strawberry” is a very unnatural kind of terminal goal. I think it would be very hard to raise a human whose primary value was the molecular duplication of a strawberry, such that this person had few other values to speak of, even if you had deep understanding of the human motivational system. I think this difficulty is relevant because I think human values are grown via RL in the human brain (I have a big doc about this), and I think deep RL agents will have values grown in a similar fashion. I have a lot to say here to unpack these intuitions, but it’d take a lot longer than a paragraph. Maybe one quick argument to make is that the only real-world intelligences we have ever seen do not seem like they could grow values like this, and I’m updating hard off of these empirical data. I do have more mechanistic reasoning to share, though, lest you think I’m inappropriately relying on analogies with humans.
    I think “do nothing else” is unnatural because it’s like you’re asking for a very intelligent mind, which wants to do one thing, but no other things. For all the ways I have imagined intelligence being trained out of randomly initialized noise, I imagine learned heuristics (e.g. if near the red-exit of the maze, go in that direction) grow into contextually activated planning with heuristic evaluations (e.g. if i’m close to the exit as the crow flies / in L2, then try exhaustive search with depth 4, and back out to greedy heuristic search if that fails), such that agents reliably pull themselves into futures where certain things are true (e.g. at the end of the maze, or have grown another strawberry), and it seems like most of these goals should be “grabby” (e.g. about proactively solving more mazes, or growing even more strawberries). That agents will not stop steering themselves into certain kinds of futures after having made a single strawberry.
    But even if this particular picture is wrong, it seems hard to fathom that you can train an intelligent mind into sophistication, and then have it “stop on a dime” after doing a single task.
    Even if that single task were “breed one new puppy into existence” (I think it’s significantly easier to get a dog-producing AI), it sure seems to me like the contextually activated cognition which brought the first puppy into existence, would again activate and find another plan to bring another puppy into existence, and that the AI would model that if it didn’t preserve itself, it couldn’t bring that next puppy into existence, and that this is a “default” in some way
    This was my first time trying to share my intuitions about this. Hopefully I managed to communicate something.
    What links here?
    Inner and outer alignment decompose one hard problem into two extremely hard problems by TurnTrout (2 Dec 2022 2:43 UTC; 147 points)
- Lone Pine 30 Jun 2022 10:56 UTC
  4 points
  1
  Parent
  The whole strawberry thing is really confusing to me; it just doesn’t map to any natural problem that humans actually care about. And when EY says “and nothing else,” it’s not clear what the actual boundaries to impact are. In order to create a strawberry, the AI must modify other things about the world. At the very least, you have to get the atoms for the strawberry from somewhere. And since the requirement is for the two strawberries to be “identical on the cellular level” (which IMO is not a scientifically ground concept), the AI would presumably have to invent some advanced technology, presumably nanotech, which requires modifying the world too. Even if the AI does nanotech R&D entirely in simulation, there is still some limited impact (due to energy use of the computers, as well as the need to print DNA or whatever to manifest the nanotech in reality, etc).
  
  I think the test should be the Wozniak test: a simple robot enters a random home and uses the tools available to make coffee. That’s a much more sensible test, we can easily verify whether there is any impact outside of the home, and if EY is right, it should be just as difficult to do consistently, given the difficulty of alignment.
  - Rob Bensinger 30 Jun 2022 19:25 UTC
    7 points
    2
    Parent
    The whole strawberry thing is really confusing to me; it just doesn’t map to any natural problem that humans actually care about.
    It maps to the pivotal acts I think are most promising; the order of difficulty seems similar to me, and the kind of problem seems similar to me too.
    And when EY says “and nothing else,” it’s not clear what the actual boundaries to impact are.
    I think EY mainly means ‘without the AI steering or directly impacting the macro-state of the world’. The task obviously isn’t possible if you literally can’t affect any atoms in the universe outside the strawberries itself. But IMO it would work fine for this challenge if the humans put a bunch of resources into a room (or into a football stadium, if needed), and got the AI to execute the task without having a direct large impact on the macro-state of the world (nor an indirect impact that involves the AI deliberately steering the world toward a new macro-state in some way).
    I think the test should be the Wozniak test: a simple robot enters a random home and uses the tools available to make coffee. [...] EY is right, it should be just as difficult to do consistently, given the difficulty of alignment.
    This seems much easier to do right, because (a) the robot can get by with being dramatically less smart, and (b) the task itself is extremely easy for humans for understand, oversee, and verify. (Indeed, the task is so simple that in real life you could just have a human hook up a camera to the robot and steer the robot by remote control.)
    For the Wozniak test, the capabilities of the system can be dramatically weaker, and the alignment is dramatically easier. This doesn’t obviously capture the things I see as hard as alignment.