jacobt comments on Superintelligent AGI in a box—a question.

jacobt 24 Feb 2012 22:55 UTC
6 points
If you only want the AI to solve things like optimization problems, why would you give it a utility function? I can see a design for a self-improving optimization problem solver that is completely safe because it doesn’t operate using utility functions:
1. Have a bunch of sample optimization problems.
2. Have some code that, given an optimization problem (stated in some standardized format), finds a good solution. This can be seeded by a human-created program.
3. When considering an improvement to program (2), allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance—k * bits of optimizer program).
4. Run (2) to optimize its own code using criterion (3). This can be done concurrently with human improvements to (2), also using criterion (3).
This would produce a self-improving AGI that would do quite well on sample optimization problems and new, unobserved optimization problems. I don’t see much danger in this setup because the program would have no reason to create malicious output. Creating malicious output would just increase complexity without increasing performance on the training set, so it would not be allowed under criterion (3), and I don’t see why the optimizer would produce code that creates malicious output.

EDIT: after some discussion, I’ve decided to add some notes:
1. This only works for verifiable (e.g. NP) problems. These problems include general induction, writing programs to specifications, math proofs, etc. This should be sufficient for the problems mentioned in the original post.
2. Don’t just plug a possibly unfriendly AI into the seed for (2). Instead, have a group of programmers write program (2) in order to do well on the training problems. This can be crowd-sourced because any improvement can be evaluated using program (3). Any improvements the system makes to itself should be safe.
I claim that if the AI is created this way, it will be safe and do very well on verifiable optimization problems. So if this thing works I’ve solved friendly AI for verifiable problems.
What links here?
- Yet another safe oracle AI proposal by jacobt (26 Feb 2012 23:45 UTC; 4 points)
- orthonormal 25 Feb 2012 6:32 UTC
  2 points
  Parent
  This seems like a better-than-average proposal, and I think you should post it on Main, but failure to imagine a loophole in a qualitatively described algorithm is far from a proof of safety.
  
  My biggest intuitive reservation is that you don’t want the iterations to be “too creative/clever/meta”, or they’ll come up with malicious ways to let themselves out (in order to grab enough computing power that they can make better progress on criterion 3). How will you be sure that the seed won’t need to be that creative already in order for the iterations to get anywhere? And even if the seed is not too creative initially, how can you be sure its descendants won’t be either?
  
  Don’t say you’ve solved friendly AI until you’ve really worked out the details.
  - jacobt 25 Feb 2012 6:39 UTC
    2 points
    Parent
    
    failure to imagine a loophole in a qualitatively described algorithm is far from a proof of safety.
    
    Right, I think more discussion is warranted.
    
    How will you be sure that the seed won’t need to be that creative already in order for the iterations to get anywhere?
    
    If general problem-solving is even possible then an algorithm exists that solves the problems well without cheating.
    
    And even if the seed is not too creative initially, how can you be sure its descendants won’t be either?
    
    I think this won’t happen because all the progress is driven by criterion (3). In order for a non-meta program (2) to create a meta-version, there would need to be some kind of benefit according to (3). Theoretically if (3) were hackable then it would be possible for the new proposed version of (2) to exploit this; but I don’t see why the current version of (2) would be more likely than, say, random chance, to create hacky versions of itself.
    
    Don’t say you’ve solved friendly AI until you’ve really worked out the details.
    
    Ok, I’ve qualified my statement. If it all works I’ve solved friendly AI for a limited subset of problems.
    - orthonormal 25 Feb 2012 15:20 UTC
      3 points
      Parent
      A couple of things:
      
      To be precise, you’re offering an approach to safe Oracle AI rather than Friendly AI.
      
      In a nutshell, what I like about the idea is that you’re explicitly handicapping your AI with a utility function that only cares about its immediate successor rather than its eventual descendants. It’s rather like the example I posed where a UDT agent with an analogously myopic utility function allowed itself to be exploited by a pretty dumb program. This seems a lot more feasible than trying to control an agent that can think strategically about its future iterations.
      
      To expand on my questions, note that in human beings, the sort of creativity that helps us write more efficient algorithms on a given problem is strongly correlated with the sort of creativity that lets people figure out why they’re being asked the specific questions they are. If a bit of meta-gaming comes in handy at any stage, if modeling the world that originated these questions wins (over the alternatives it enumerated at that stage) on criteria 3 even once, then we might be in trouble.
- TimS 25 Feb 2012 1:45 UTC
  1 point
  Parent
  unFriendly AI need not be malicious. If your AI’s only goal is to solve optimization problems, what happens when the AI gets a peek a human society, codes it as an optimization problem, and solves for X?
  - jacobt 25 Feb 2012 2:40 UTC
    1 point
    Parent
    The output of the program is purely binary/textual. If a programmer gave the AI the task “give me text that, if I followed it, would optimize human society according to utility function U”, then the AI might spit out an answer that, if carried out, would have bad consequences. The easy solution is to not ask the program to solve problems that have highly variable/subjective value depending on utility function and just ask it to optimize, say, microchip design for efficiency. I don’t think that a microchip design found by simple optimization methods that perform well on other problems would be dangerous.
    - Dmytry 25 Feb 2012 3:22 UTC
      2 points
      Parent
      I can ask it to solve something, and it starts figuring out what to do to solve my problem, based on it’s own thought and what it learnt. At this point it realizes that solution of problem requires more computing power, and reasons all the way to next step being eating the Earth and other planets, which it doesn’t even see as bad. That happens very suddenly when you feed the AI real world information for some reason and it invents a new way to solve problems.
      - jacobt 25 Feb 2012 3:25 UTC
        0 points
        Parent
        Look at how the system is set up. Would code for (2) that eats up resources score better according to objective function (3)? No, because the criterion for (3) should evaluate program (2) by giving it only a constant amount of computing power; nothing the program does would change how much computing power (3) would give it.
        Dmytry 25 Feb 2012 3:47 UTC
        3 points
        Parent
        Define the 3.
        
        Halting problem being what it is, (i guess) the 3 works by taking a look and seeing how much CPU time is used (or memory or what ever). On the hardware that is being monitored by 3. The extra hardware that the AI can add, though… also good luck defining ‘extra hardware’ formally.
        
        edit: of course you can say it is sandboxed and haven’t got hands, but it wont be long until you start, idk, wanting real bad to optimize proteins or DNA or the like.
        jacobt 25 Feb 2012 3:51 UTC
        0 points
        Parent
        Ok, pseudo-Python:
        
        def eval_algorithm(alg): score = 0 for problem in problems: output = resource_bounded_execute(alg, nsteps, problem) score += problem.outputScore(output) return score - k * len(alg)
        Where resource_bounded_execute is a modified interpreter that fails after alg executes nsteps.
        
        edit: of course you can say it is sandboxed and haven’t got hands, but it wont be long until you start, idk, optimizing proteins or DNA or the like.
        
        Again, I don’t see why a version of (2) that does weird stuff with proteins and DNA will make the above python program (3) give it a higher score.
        What links here?
        jacobt's comment on Superintelligent AGI in a box—a question. by Dmytry (25 Feb 2012 4:21 UTC; 0 points)
        Dmytry 25 Feb 2012 4:44 UTC
        3 points
        Parent
        that’s AI you’re keeping safe by keeping it in a box, basically. If the resource bounded execute lets the alg get online the alg is free to hack into servers.
        
        Plus it is not AGI, and people will be using it to make AGI or hardware for AGI.
        
        It is also not very general purpose. You are defining the scoring.
        
        And you start with a human written program that non-trivially improves it’s own ability to solve problems (and it does so in nsteps for improving own ability to solve N problems in nsteps each). You basically start off with some mighty powerful artificial intelligence.
        jacobt 25 Feb 2012 4:50 UTC
        0 points
        Parent
        
        If the resource bounded execute lets the alg get online the alg is free to hack into servers.
        
        So don’t do that.
        
        Plus it is not AGI, and people will be using it to make AGI or hardware for AGI.
        
        See my other post, it can solve many many different problems, e.g. general induction and the problems in your original post (such as optimizing source code, assuming we have a specification for the source code).
        
        You basically start off with some mighty powerful artificial intelligence.
        
        This framework is meant to provide a safe framework for this powerful AI to become even more powerful without destroying the world in the process. Also, the training set provides a guide for humans trying to write the code.
        
        To reiterate: no, I haven’t solved friendly AI, but I think I’ve solved friendly AI for verifiable problems.
        Dmytry 25 Feb 2012 5:14 UTC
        2 points
        Parent
        Okay, say, as a starting point in your framework we got the optimizing AI from my original post, armed with whatever it learnt off a copy of internet, and wanting to do something unintended which requires getting out and doing stuff in real world.
        
        The thing sees your example problem(s), it solves them poorly (playing dumb).
        
        The thing sees self optimization problem, it makes a functionally equivalent copy of self thats all micro optimized and remembers facing self optimization problem and knows that now it should solve example problems a small bit less poorly.
        
        Hurray, it is working, proclaims your team.
        
        Eventually it plausibly gets as good as you can get in nsteps.
        
        Some day it is given some real world problems to solve, it makes underhanded solutions, again, not the best it could.
        
        Bottom line is, your kind of thinking is precisely what my example AGI in original post wants to exploit
        jacobt 25 Feb 2012 5:20 UTC
        0 points
        Parent
        Yes, it’s a very bad idea to take the AI from your original post and then stick it into my framework. But if we had programmers initially working within my framework to create the AI according to criterion (3) in good faith, then I think any self-improvements the system makes would also be safe. If we already had an unfriendly AGI we’d be screwed anyway.
        Expand this thread
        Dmytry 25 Feb 2012 6:18 UTC
        2 points
        Parent
        That kind of stuff is easy in low resolution un-detailed thought… but look with more details...
        
        I think you confused yourself (and me too) with regards to what the AI would be optimizing, confusing this with what the framework ‘wants’ it to optimize. The scoring functions can be very expensive to evaluate.
        
        Here you have the 4, which is the whole point of the entire exercise. The scoring function here is over M times more expensive to evaluate than the AI run itself, where M is the number of test problems (which you’ll want very huge). You’d actually want to evaluate AI’s ability to do 4, too, but that’d enter infinite recursion.
        
        When you are working on a problem where you can’t even evaluate the scoring function inside your AI—not even remotely close—you have to make some heuristics, some substitute scoring.
        
        Let’s consider chess as example:
        
        The goal of chess is to maximize win value, the win values being enemy checkmated>tie>you are checkmated.
        
        The goal of the chess AI developed with maximization of win in mind, is instead perhaps to maximize piece dis-balance in 7 ply.
        
        (This works better for maximizing win, given limited computation, than trying to maximize the win!)
        
        And once you have an AI inside your framework which is not maximizing the value that your framework is maximizing—it’s potentially AI from my original post in your framework, getting out.
        jacobt 25 Feb 2012 6:27 UTC
        0 points
        Parent
        
        When you are working on a problem where you can’t even evaluate the scoring function inside your AI—not even remotely close—you have to make some heuristics, some substitute scoring.
        
        You’re right, this is tricky because the self-optimizer thread (4) might have to call (3) a lot. Perhaps this can be fixed by giving the program more time to find self-optimizations. Or perhaps the program could use program (3)’s specification/source code rather than directly executing it, in order to figure out how to optimize it heuristically. Either way it’s not perfect. At worst program (4) will just fail to find optimizations in the allowed time.
        
        And once you have an AI inside your framework which is not maximizing the value that your framework is maximizing—it’s potentially AI from my original post in your framework, getting out.
        
        Ok, if you plopped your AI into my framework it would be terrible. But I don’t see how the self-improvement process would spontaneously create an unfriendly AI.
        Dmytry 25 Feb 2012 7:42 UTC
        2 points
        Parent
        The framework, as we already have established, would not keep an AI from maximizing what ever the AI wants to maximize.
        
        The framework also does nothing to prevent AI from creating a more effective problem solving AI that is more effective at problem solving by not evaluating your problem solving functions on various candidate solutions, and instead doing something else that’s more effective. I.e. the AI with some substitute goals of it’s own instead of straightforward maximization of scores. (Heh, the whole point of exercise is to create AI that would keep self improving, meaning, would improve it’s ability to self improve. Which is something that you can only do by some kind of goal substitution because the evaluation of the ability to self improve is too expensive—the goal is a something that you evaluate many times.)
        
        So what does the framework do, exactly, that would improve safety here? Beyond keeping the AI in the rudimentary box, and making it very dubious that the AI would at all self improve. Yes, it is very dubious that under this framework the unfriendly AI will arise but is some added safety, or is it a special case of general dubiousness that a self improvement would take place? I don’t see added safety. I don’t see framework impeding growing unfriendliness any more than it would impede self improvement.
        
        edit: maybe should just say, nonfriendly. Any AI that is not friendly, can just eat you up when hungry and it doesn’t need you.
        jacobt 25 Feb 2012 9:29 UTC
        0 points
        Parent
        
        The framework, as we already have established, would not keep an AI from maximizing what ever the AI wants to maximize.
        
        That’s only if you plop a ready-made AGI in the framework. The framework is meant to grow a stupider seed AI.
        
        The framework also does nothing to prevent AI from creating a more effective problem solving AI that is more effective at problem solving by not evaluating your problem solving functions on various candidate solutions, and instead doing something else that’s more effective.
        
        Program (3) cannot be re-written. Program (2) is the only thing that is changed. All it does is improve itself and spit out solutions to optimization problems. I see no way for it to “create a more effective problem solving AI”.
        
        So what does the framework do, exactly, that would improve safety here?
        
        It provides guidance for a seed AI to grow to solve optimization problems better without having it take actions that have effects beyond its ability to solve optimization problems.
        Dmytry 25 Feb 2012 13:22 UTC
        2 points
        Parent
        A lot goes into solving the optimization problems without invoking the scoring function a trillion times (which would entirely prohibit self improvement).
        
        Look at where similar kind of framework got us, the homo sapiens. We were minding our business evolving, maximizing own fitness, which was the all we could do. We were self improving (the output being next generation’s us). Now there’s talk of Large Hadron Collider destroying the world. It probably won’t, of course, but we’re pretty well going along the bothersome path. We also started as a pretty stupid seed AI, a bunch of monkeys. Scratch that, as unicellular life.
    - [deleted] 25 Feb 2012 3:26 UTC
      0 points
      Parent
      If the problems are simple, why do you need a superintelligence? If they’re not, how are you verifying the results?
      
      More importantly, how are you verifying that your (by necessity incredibly complicated) universal optimizing algorithms are actually doing what you want? It’s not like you can sit down and write out a proof—nontrivial applications of this technique are undecidable. (Also, “some code that . . . finds a good solution” is just a little bit of an understatement. . .)
      - jacobt 25 Feb 2012 3:33 UTC
        0 points
        Parent
        The problems are easy to verify but hard to solve (like many NP problems). Verify the results through a dumb program. I verify that the optimization algorithms do what I want by testing them against the training set; if it does well on the training set without overfitting it too much, it should do well on new problems.
        
        As for how useful this is: I think general induction (resource-bounded Solomonoff induction) is NP-like in that you can verify an inductive explanation is a relatively short time. Just execute the program and verify that its output matches the observations so far.
        
        (Also, “some code that . . . finds a good solution” is just a little bit of an understatement. . .)
        
        Yes, but any seed AI will be difficult to write. This setup allows the seed program to improve itself.
        
        edit: I just realized that mathematical proofs are also verifiable. So, a program that is very very good at verifiable optimization problems will be able to prove many mathematical things. I think all these problems it could solve are sufficient to demonstrate that it is an AGI and very very useful.
        What links here?
        jacobt's comment on Superintelligent AGI in a box—a question. by Dmytry (25 Feb 2012 4:21 UTC; 0 points)
        [deleted] 25 Feb 2012 4:42 UTC
        1 point
        Parent
        
        Verify the results through a dumb program.
        
        You appear to be operating under the assumption that you can just write a program that analyzes arbitrarily complicated specifications for how to organize matter and hands you a “score” that’s in some way related to the actual functionality of those specifications. Or possibly that you can make exhaustive predictions about the results to problems complicated enough to justify developing an AGI superintelligence in the first place. Which is, to be frank, about as likely as you solving the problems by way of randomly mixing chemicals and hoping something useful happens.
        jacobt 25 Feb 2012 4:46 UTC
        0 points
        Parent
        This system is only meant to solve problems that are verifiable (e.g. NP problems). Which includes general induction, mathematical proofs, optimization problems, etc. I’m not sure how to extend this system to problems that aren’t efficiently verifiable but it might be possible.
        
        One use of this system would be to write a seed AI once we have a specification for the seed AI. Specifying the seed AI itself is quite difficult, but probably not as difficult as satisfying that specification.
        [deleted] 25 Feb 2012 4:59 UTC
        0 points
        Parent
        It can prove things about mathematics than can be proven procedurally, but that’s not all that impressive. Lots of real-world problems are either mathematically intractable (really intractable, not just “computers aren’t fast enough yet” intractable) or based in mathematics that aren’t amenable to proofs. So you approximate and estimate and experiment and guess. Then you test the results repeatedly to make sure they don’t induce cancer in 80% of the population, unless the results are so complicated that you can’t figure out what it is you’re supposed to be testing.
        jacobt 25 Feb 2012 5:02 UTC
        0 points
        Parent
        Right, this doesn’t solve friendly AI. But lots of problems are verifiable (e.g. hardware design, maybe). And if the hardware design the program creates causes cancer and the humans don’t recognize this until it’s too late, they probably would have invented the cancer-causing hardware anyway. The program has no motive other than to execute an optimization program that does well on a wide variety of problems.
        
        Basically I claim that I’ve solved friendly AI for verifiable problems, which is actually a wide class of problems, including the problems mentioned in the original post (source code optimization etc.)
        TimS 25 Feb 2012 4:31 UTC
        0 points
        Parent
        Now it doesn’t seem like your program is really a general artificial intelligence—improving our solutions to NP problems is neat, but not “general intelligence.” Further, there’s no reason to think that “easy to verify but hard to solve problems” include improvements to the program itself. In fact, there’s every reason to think this isn’t so.
        jacobt 25 Feb 2012 4:36 UTC
        0 points
        Parent
        
        Now it doesn’t seem like your program is really a general artificial intelligence—improving our solutions to NP problems is neat, but not “general intelligence.”
        
        General induction, general mathematical proving, etc. aren’t general intelligence? Anyway, the original post concerned optimizing things program code, which can be done if the optimizations have to be proven.
        
        Further, there’s no reason to think that “easy to verify but hard to solve problems” include improvements to the program itself. In fact, there’s every reason to think this isn’t so.
        
        That’s what step (3) is. Program (3) is itself an optimizable function which runs relatively quickly.
    - TimS 25 Feb 2012 3:24 UTC
      0 points
      Parent
      Well, one way to be a better optimizer is to ensure that one’s optimizations are actually implemented. When the program self-modifies, how do we ensure that this capacity is not created? The worst case scenario is that the program learns to improve its ability to persuade you that changes to the code should be authorized.
      
      In short, allowing the program to “optimize” itself does not define what should be optimized. Deciding what should be optimized is the output of some function, so I suggest calling that the “utility function” of the program. If you don’t program it explicitly, you risk such a function appearing through unintended interactions of functions that were programmed explicitly.
      - jacobt 25 Feb 2012 3:36 UTC
        0 points
        Parent
        
        Well, one way to be a better optimizer is to ensure that one’s optimizations are actually implemented.
        
        No, changing program (2) to persuade the human operators will not give it a better score according to criterion (3).
        
        In short, allowing the program to “optimize” itself does not define what should be optimized. Deciding what should be optimized is the output of some function, so I suggest calling that the “utility function” of the program. If you don’t program it explicitly, you risk such a function appearing through unintended interactions of functions that were programmed explicitly.
        
        I assume you’re referring to the fitness function (performance on training set) as a utility function. It is sort of like a utility function in that the program will try to find code for (2) that improves performance for the fitness function. However it will not do anything like persuading human operators to let it out in order to improve the utility function. It will only execute program (2) to find improvements. Since it’s not exactly like a utility function in the sense of VNM utility it should not be called a utility function.
        TimS 25 Feb 2012 4:18 UTC
        0 points
        Parent
        
        allow the improvement if it makes it do better on average on the sample optimization problems without being significantly more complex (to prevent overfitting). That is, the fitness function would be something like (average performance—k * bits of optimizer program).
        
        Who exactly is doing the “allowing”? If the program, the criteria for allowing changes hasn’t been rigorously defined. If the human, how are we verifying that there is improvement over average performance? There is no particular guarantee that the verification of improvement will be easier than discovering the improvement (by hypothesis, we couldn’t discover the latter without the program).
        jacobt 25 Feb 2012 4:21 UTC
        0 points
        Parent
        
        Who exactly is doing the “allowing”?
        
        Program (3), which is a dumb, non-optimized program. See this for how it could be defined.
        
        There is no particular guarantee that the verification of improvement will be easier than discovering the improvement (by hypothesis, we couldn’t discover the latter without the program).
        
        See this. Many useful problems are easy to verify and hard to solve.
- earthwormchuck163 26 Feb 2012 0:14 UTC
  0 points
  Parent
  At best, this will produce cleverly efficient solutions to your sample problems.