Kaarel comments on Various Alignment Strategies (and how likely they are to work)

Kaarel 4 May 2022 1:35 UTC
1 point
There is also another way that super-intelligent AI could be aligned by definition. Namely, if your utility function isn’t “humans survive” but instead “I want the future to be filled with interesting stuff”. For all the hand-wringing about paperclip maximizers, the fact remains that any AI capable of colonizing the universe will probably be pretty cool/interesting. Humans don’t just create poetry/music/art because we’re bored all the time, but rather because expressing our creativity helps us to think better. It’s probably much harder to build an AI that wipes out all humans and then colonizes space and is also super-boring, than to make one that does those things in a way people who fantasize about giant robots would find cool.
I’m not convinced that (the world with) a superintelligent AI would probably be pretty cool/interesting. Does anyone know of a post/paper/(sci-fi )book/video/etc that discusses this? (I know there’s this :P and maybe this.) Perhaps let’s discuss this! I guess the answer depends on how human-centered/inspired (not quite the right term, but I couldn’t come up with a better one) our notion of interestingness is in this question. It would be cool to have a plot of expected interestingness of the first superintelligence (or well, instead of expectation it is better to look at more parameters, but you get the idea) as a function of human-centeredness of what’s meant by “interestingness”. Of course, figuring this out in detail would be complicated, but it nevertheless seems likely that something interesting could be said about it.
I think we (at least also) create poetry/music/art because of godshatter. To what extent should we expect AI to godshatter, vs do something like spending 5 minutes finding one way to optimally turn everything into paperclips and doing that for all eternity? The latter seems pretty boring. Or idk, maybe the “one way” is really an exciting enough assortment of methods that it’s still pretty interesting even if it’s repeated for all eternity?
What links here?
- The Last Paperclip by Logan Zoellner (12 May 2022 19:25 UTC; 62 points)
- Logan Zoellner 4 May 2022 4:34 UTC
  8 points
  Parent
  On the one hand, your definition of “cool and interesting” may be different from mine, so it’s entirely possible I would find a paperclip maximizer cool but you wouldn’t. As a mathematician I find a lot of things interesting that most people hate (this is basically a description of all of math).
  On the other hand, I really don’t buy many of the arguments in “value is fragile”. For example:
  And you might be able to see how the vast majority of possible expected utility maximizers, would only engage in just so much efficient exploration, and spend most of its time exploiting the best alternative found so far, over and over and over.
  I simply disagree with this claim. The coast guard and fruit flies both use Levy Flights because they are mathematically optimal. Boredom isn’t some special feature of human beings, it is an approximation to the best possible algorithm for solving the exploration problem. Super-intelligent AI will have a better approximation, and therefore better boredom.
  EY seems to also be worried that super-intelligent AI might not have qualia, but my understanding of his theory of consciousness is that “has qualia” is synonymous with “reasons about coalitions of coalitions”, so I’m not sure how an agent can be good at that and not have qualia.
  The most defensible version of “paperclip maximizers are boring” would be something like this video. But unlike MMOs, I don’t think there is a single “meta” that solves the universe (even if all you care about is paperclips). Take a look at this list of undecidable problems and consider whether any of them might possibly be relevant to filling the universe with paperclips. If they are, then an optimal paperclip maximizer has an infinite set of interesting math problems to solve in its future.
  - gwern 12 May 2022 20:33 UTC
    13 points
    Parent
    One observation that comes to mind is that the end of games for very good players tends to be extremely simple. A Go game by a pro crushing the other player doesn’t end in a complicated board which looks like the Mona Lisa; it looks like a boring regular grid of black stones dotted with 2 or 3l voids. Or if we look at chess endgame databases, which are provably optimal and perfect play, we don’t find all the beautiful concepts of chess tactics and strategy that we love to analyze—we just find mysterious, baffingly arbitrary moves which make no sense and which continue to make no sense when we think about them and have no justification other than “when we brute force every possibility, this is what we get”, but, nevertheless, happen to be perfect for winning. In reinforcement learning, the overall geometry of ‘strategy space’ has been described as looking like a <> diamond: early on, with poor players, there are few coherent strategies; medium-strength players can enjoy a wide variety of interestingly-distinct diverse strategies; but then as they approach perfection, strategy space collapses down to the Nash equilibrium. (If there is only one Nash equilibrium, well, that’s pretty depressingly boring; if there are more than one, many of them may just never get learned because there is by definition no need to learn them and they can’t be invaded, and even if they do get learned, there will still probably be many fewer than suboptimal strategies played earlier on.) So, in the domains where we can approach perfection, the idea that there will always be large amounts of diversity and interesting behaviors does not seem to be doing well.
    
    Undecidable problems being undecidable doesn’t really help much. After all, you provably can’t solve them in general, and how often will any finite decidable instance come up in practice? How often does it come up after being made to not come up? Just because a problem exists doesn’t mean it’s worth caring about or solving. There are many ways around or ignoring problems like impossibility proofs or no-go theorems or bad asymptotics. (You can easily see how a lot of my observations about computational complexity ‘proving AI impossible’ would apply to any claim that a paperclipper has to solve the Halting Problem or something.)
    - Logan Zoellner 13 May 2022 13:29 UTC
      3 points
      Parent
      So, in the domains where we can approach perfection, the idea that there will always be large amounts of diversity and interesting behaviors does not seem to be doing well.
      I suspect that a paperclip maximizer would look less like perfect Go play and more like a TAS speedrun of Mario. Different people have different ideas of interesting, but I personally find TAS’s fun to watch.
      The much longer version of this argument is here.
      - gwern 14 May 2022 1:53 UTC
        3 points
        Parent
        Yeah, I realized after I wrote it that I should’ve brought in speedrunning and related topics even if they are low-status compared to Go/chess and formal reinforcement learning research.
        
        I disagree that they are all that interesting: a lot of TASes don’t look like “amazing skilled performance that brings you to tears to watch” but “the player stands in place twitching for 32.1 seconds and then teleports to the YOU WIN screen”.* (Which is why regular games need to constantly patch to keep the meta alive and not collapse into cheese or a Nash equilibrium or cycle.) Even the ones not quite that broken are still deeply dissatisfying to watch; one that’s closely analogous to the chess endgame databases and doesn’t involve ‘magic’ is this bruteforce of Arkanoid’s game tree—the work that goes into solving the MDP efficiently is amazing and fascinating, but watching the actual game play is to look into an existential void of superintelligence without comprehension or meaning (never mind beauty).
        
        The process of developing or explaining a speedrun can be interesting, like that Arkanoid example—but only once. And then you have all the quadrillions of repetitions afterwards executing the same optimal policy. Because the game can’t change, so the optimal policy can’t either. There is no diversity or change or fun. Only perfection.
        
        (Which is where I disagree with “The Last Paperclip”; the idea of A and D being in an eternal stasis is improbable, the equilibrium or stasis would shatter almost immediately, perfection reached, and then all the subsequent trillions of years would just be paperclipping. In the real world, there’s no deity which can go “oh, that nanobot is broken, we’d better nerf it”. Everything becomes a trilobite.)
        
        EDIT: another example is how this happens to games like Tom Ray’s Tierra or Core Wars or the Prisoners’ Dilemma tournaments here on LW: under any kind of resource constraint, the best agent is typically some extremely simple fast replicator or attacker which can tear through enemies faster than they can react, neither knowing nor caring about exactly what flavor enemy-of-the-week they are chewing up and digesting. Think Indiana Jones and the sword guy. (Analogies to infectious diseases and humans left as an exercise for the reader...) Intelligence and flexibility are very expensive, and below a certain point, pretty lousy tools which only just barely pay their way in only a few ecological niches. It requires intervention and design and slack to enable any kind of complex strategies to evolve. If someone shows you some DRL research like AI-GAs where agents rapidly evolve greater intelligence, this only works at all because the brains are ‘outside’ the simulation and thinking is free. If those little agents in a, say, DeepMind soccer simulation had to pay for all their thinking, they’d never get past a logistic regression in complexity. Similarly, one asteroid here or there, and an alien flying into the Solar System would conclude that viruses & parasites really are the ultimate and perfect life forms in terms of reproductive fitness in playing the game of life. (And beetles.)
        
        * An example: the hottest game of the moment, a critical darling for its quality, by a team that has implemented many prior highly-successful open-world 3D games before, is Elden Ring, designed to give even a master player hours of challenges. Nevertheless, you can beat it in <7 minutes by not much more than running through a few doors and twitching in place. (The twitching accelerates you at ultra-velocity ‘through’ the game and when you launch & land just right it kills the bosses, somehow. It will doubtless be improved over time.)
        Logan Zoellner 14 May 2022 12:31 UTC
        1 point
        Parent
        I disagree that they are all that interesting: a lot of TASes don’t look like “amazing skilled performance that brings you to tears to watch” but “the player stands in place twitching for 32.1 seconds and then teleports to the YOU WIN screen”.
        I fully concede that a Paperclip Maximizer is way less interesting if there turns out to be some kind of false vacuum that allows you to just turn the universe into a densely tiled space filled with paperclips expanding at the speed of light.
        It would be cool to make an classification of games where perfect play is interesting (Busy Beaver Game, Mao, Calvinball) vs games where it is boring (Tic-Tac-Toe, Checkers). I suspect that since Go is merely EXP-Time complete (not Turing complete) it falls in the 2nd category. But it’s possible that e.g. optimal Go play involves a Mixed Strategy Nash Equilibrium drawing on an infinite set of strategies with ever-decreasing probability.
        Problem left for the reader: prove the existence of a game which is not Turing Complete but where optimal play requires an infinite number of strategies such that no computable algorithm outputs all of these strategies.
        the idea of A and D being in an eternal stasis is improbable
        I did cheat in the story by giving D a head start (so it could eternally outrun A by fleeing away at 0.99C). However, in general this depends on how common intelligent life is elsewhere in the universe. If the majority of A’s future light-cone is filled with non-paperclipping intelligent beings (and there is no false-vacuum/similar “hack”), then I think A has to remain intelligent.