Rob Bensinger comments on A central AI alignment problem: capabilities generalization, and the sharp left turn

Rob Bensinger 30 Jun 2022 19:15 UTC
LW: 11 AF: 7
−1
AF
When I think about the strawberry problem, it seems unnatural, and perhaps misleading of our attention, since there’s no guarantee there’s even a reasonable solution.
Why would there not be a solution?
- TurnTrout 3 Jul 2022 2:03 UTC
  LW: 14 AF: 7
  7
  AF Parent
  To clarify, I said there might not be a reasonable solution (i.e. such that solving the strawberry problem isn’t significantly harder than solving pivotal-act alignment).
  Not directly answering your Q, but here’s why it seems unnatural and maybe misleading-of-attention. Copied from a Slack message I sent:
  First, I suspect that even an aligned AI would fail the “duplicate a strawberry and do nothing else” challenge, because such an AI would care about human life and/or about cooperating with humans, and would be asked to stand by while 1.8 humans die each second and the world inches closer to doom via unaligned AI. (And so it also seems to need to lack a self-preservation drive)
  
  Since building an aligned AI is not a sufficient condition, this opens the possibility that the strawberry problem is actually harder than the alignment task we need to solve to come out of AGI alive.
  Two separate difficulties with this particular challenge:
  I think “duplicate a single strawberry” is a very unnatural kind of terminal goal. I think it would be very hard to raise a human whose primary value was the molecular duplication of a strawberry, such that this person had few other values to speak of, even if you had deep understanding of the human motivational system. I think this difficulty is relevant because I think human values are grown via RL in the human brain (I have a big doc about this), and I think deep RL agents will have values grown in a similar fashion. I have a lot to say here to unpack these intuitions, but it’d take a lot longer than a paragraph. Maybe one quick argument to make is that the only real-world intelligences we have ever seen do not seem like they could grow values like this, and I’m updating hard off of these empirical data. I do have more mechanistic reasoning to share, though, lest you think I’m inappropriately relying on analogies with humans.
  I think “do nothing else” is unnatural because it’s like you’re asking for a very intelligent mind, which wants to do one thing, but no other things. For all the ways I have imagined intelligence being trained out of randomly initialized noise, I imagine learned heuristics (e.g. if near the red-exit of the maze, go in that direction) grow into contextually activated planning with heuristic evaluations (e.g. if i’m close to the exit as the crow flies / in L2, then try exhaustive search with depth 4, and back out to greedy heuristic search if that fails), such that agents reliably pull themselves into futures where certain things are true (e.g. at the end of the maze, or have grown another strawberry), and it seems like most of these goals should be “grabby” (e.g. about proactively solving more mazes, or growing even more strawberries). That agents will not stop steering themselves into certain kinds of futures after having made a single strawberry.
  But even if this particular picture is wrong, it seems hard to fathom that you can train an intelligent mind into sophistication, and then have it “stop on a dime” after doing a single task.
  Even if that single task were “breed one new puppy into existence” (I think it’s significantly easier to get a dog-producing AI), it sure seems to me like the contextually activated cognition which brought the first puppy into existence, would again activate and find another plan to bring another puppy into existence, and that the AI would model that if it didn’t preserve itself, it couldn’t bring that next puppy into existence, and that this is a “default” in some way
  This was my first time trying to share my intuitions about this. Hopefully I managed to communicate something.
  What links here?
  - Inner and outer alignment decompose one hard problem into two extremely hard problems by TurnTrout (2 Dec 2022 2:43 UTC; 147 points)