gwern comments on Various Alignment Strategies (and how likely they are to work)

gwern 12 May 2022 20:33 UTC
13 points
One observation that comes to mind is that the end of games for very good players tends to be extremely simple. A Go game by a pro crushing the other player doesn’t end in a complicated board which looks like the Mona Lisa; it looks like a boring regular grid of black stones dotted with 2 or 3l voids. Or if we look at chess endgame databases, which are provably optimal and perfect play, we don’t find all the beautiful concepts of chess tactics and strategy that we love to analyze—we just find mysterious, baffingly arbitrary moves which make no sense and which continue to make no sense when we think about them and have no justification other than “when we brute force every possibility, this is what we get”, but, nevertheless, happen to be perfect for winning. In reinforcement learning, the overall geometry of ‘strategy space’ has been described as looking like a <> diamond: early on, with poor players, there are few coherent strategies; medium-strength players can enjoy a wide variety of interestingly-distinct diverse strategies; but then as they approach perfection, strategy space collapses down to the Nash equilibrium. (If there is only one Nash equilibrium, well, that’s pretty depressingly boring; if there are more than one, many of them may just never get learned because there is by definition no need to learn them and they can’t be invaded, and even if they do get learned, there will still probably be many fewer than suboptimal strategies played earlier on.) So, in the domains where we can approach perfection, the idea that there will always be large amounts of diversity and interesting behaviors does not seem to be doing well.

Undecidable problems being undecidable doesn’t really help much. After all, you provably can’t solve them in general, and how often will any finite decidable instance come up in practice? How often does it come up after being made to not come up? Just because a problem exists doesn’t mean it’s worth caring about or solving. There are many ways around or ignoring problems like impossibility proofs or no-go theorems or bad asymptotics. (You can easily see how a lot of my observations about computational complexity ‘proving AI impossible’ would apply to any claim that a paperclipper has to solve the Halting Problem or something.)
- Logan Zoellner 13 May 2022 13:29 UTC
  3 points
  Parent
  So, in the domains where we can approach perfection, the idea that there will always be large amounts of diversity and interesting behaviors does not seem to be doing well.
  I suspect that a paperclip maximizer would look less like perfect Go play and more like a TAS speedrun of Mario. Different people have different ideas of interesting, but I personally find TAS’s fun to watch.
  The much longer version of this argument is here.
  - gwern 14 May 2022 1:53 UTC
    3 points
    Parent
    Yeah, I realized after I wrote it that I should’ve brought in speedrunning and related topics even if they are low-status compared to Go/chess and formal reinforcement learning research.
    
    I disagree that they are all that interesting: a lot of TASes don’t look like “amazing skilled performance that brings you to tears to watch” but “the player stands in place twitching for 32.1 seconds and then teleports to the YOU WIN screen”.* (Which is why regular games need to constantly patch to keep the meta alive and not collapse into cheese or a Nash equilibrium or cycle.) Even the ones not quite that broken are still deeply dissatisfying to watch; one that’s closely analogous to the chess endgame databases and doesn’t involve ‘magic’ is this bruteforce of Arkanoid’s game tree—the work that goes into solving the MDP efficiently is amazing and fascinating, but watching the actual game play is to look into an existential void of superintelligence without comprehension or meaning (never mind beauty).
    
    The process of developing or explaining a speedrun can be interesting, like that Arkanoid example—but only once. And then you have all the quadrillions of repetitions afterwards executing the same optimal policy. Because the game can’t change, so the optimal policy can’t either. There is no diversity or change or fun. Only perfection.
    
    (Which is where I disagree with “The Last Paperclip”; the idea of A and D being in an eternal stasis is improbable, the equilibrium or stasis would shatter almost immediately, perfection reached, and then all the subsequent trillions of years would just be paperclipping. In the real world, there’s no deity which can go “oh, that nanobot is broken, we’d better nerf it”. Everything becomes a trilobite.)
    
    EDIT: another example is how this happens to games like Tom Ray’s Tierra or Core Wars or the Prisoners’ Dilemma tournaments here on LW: under any kind of resource constraint, the best agent is typically some extremely simple fast replicator or attacker which can tear through enemies faster than they can react, neither knowing nor caring about exactly what flavor enemy-of-the-week they are chewing up and digesting. Think Indiana Jones and the sword guy. (Analogies to infectious diseases and humans left as an exercise for the reader...) Intelligence and flexibility are very expensive, and below a certain point, pretty lousy tools which only just barely pay their way in only a few ecological niches. It requires intervention and design and slack to enable any kind of complex strategies to evolve. If someone shows you some DRL research like AI-GAs where agents rapidly evolve greater intelligence, this only works at all because the brains are ‘outside’ the simulation and thinking is free. If those little agents in a, say, DeepMind soccer simulation had to pay for all their thinking, they’d never get past a logistic regression in complexity. Similarly, one asteroid here or there, and an alien flying into the Solar System would conclude that viruses & parasites really are the ultimate and perfect life forms in terms of reproductive fitness in playing the game of life. (And beetles.)
    
    * An example: the hottest game of the moment, a critical darling for its quality, by a team that has implemented many prior highly-successful open-world 3D games before, is Elden Ring, designed to give even a master player hours of challenges. Nevertheless, you can beat it in <7 minutes by not much more than running through a few doors and twitching in place. (The twitching accelerates you at ultra-velocity ‘through’ the game and when you launch & land just right it kills the bosses, somehow. It will doubtless be improved over time.)
    - Logan Zoellner 14 May 2022 12:31 UTC
      1 point
      Parent
      I disagree that they are all that interesting: a lot of TASes don’t look like “amazing skilled performance that brings you to tears to watch” but “the player stands in place twitching for 32.1 seconds and then teleports to the YOU WIN screen”.
      I fully concede that a Paperclip Maximizer is way less interesting if there turns out to be some kind of false vacuum that allows you to just turn the universe into a densely tiled space filled with paperclips expanding at the speed of light.
      It would be cool to make an classification of games where perfect play is interesting (Busy Beaver Game, Mao, Calvinball) vs games where it is boring (Tic-Tac-Toe, Checkers). I suspect that since Go is merely EXP-Time complete (not Turing complete) it falls in the 2nd category. But it’s possible that e.g. optimal Go play involves a Mixed Strategy Nash Equilibrium drawing on an infinite set of strategies with ever-decreasing probability.
      Problem left for the reader: prove the existence of a game which is not Turing Complete but where optimal play requires an infinite number of strategies such that no computable algorithm outputs all of these strategies.
      the idea of A and D being in an eternal stasis is improbable
      I did cheat in the story by giving D a head start (so it could eternally outrun A by fleeing away at 0.99C). However, in general this depends on how common intelligent life is elsewhere in the universe. If the majority of A’s future light-cone is filled with non-paperclipping intelligent beings (and there is no false-vacuum/similar “hack”), then I think A has to remain intelligent.