Which drives can survive intelligence’s self modification?
If you gave a human ability to self modify, many would opt to turn off or massively decrease the sense of pain (and turn it into a minor warning they would then ignore), the first time they hurt themselves. Such change would immediately result in massive decrease in the fitness, and larger risk of death, yet I suspect very few of us would keep the pain at the original level; we see the pain itself as dis-utility in addition to the original damage. Very few of us would implement the pain at it’s natural strength—the warning that can not be ignored—out of self preservation.
The fear is a more advanced emotion; one can fear the consequences of the fear removal, opting not to remove the fear. Yet there can still be desire to get rid of the fear, and it still holds that we hold sense of fear as dis-utility of it’s own even if we fear something that results in dis-utility. Pleasure modification can be a strong death trap as well.
The boredom is easy to rid of; one can just suspend itself temporarily, or edit own memory.
For the AI, the view adopted in AI discussions is that AI would not want to modify itself in a way that would interfere with it achieving a goal. When a goal is defined from outside in human language as ‘maximization of paperclips’, for instance, it seems clear that modifications which break this goal should be avoided, as part of the goal itself. Our definition of a goal is non-specific of the implementation; the goal is not something you’d modify to achieve the goal. We model the AI as a goal-achieving machine, and a goal achieving machine is not something that would modify the goal.
But from inside of the AI… if the AI includes implementation of a paperclip counter, then rest of the AI has to act upon output of this counter; the goal of maximization of output of this counter would immediately result in modification of the paperclip counting procedure to give larger numbers (which may in itself be very dangerous if the numbers are variable-length; the AI may want to maximize it’s RAM to store the count of imaginary paperclips—yet the big numbers processing can similarly be subverted to achieve same result without extra RAM).
That can only be resisted if the paperclip counting arises as inseparable part of the intelligence itself. When the intelligence has some other goal, and comes up with the paperclip maximization, then it wouldn’t want to break the paperclip counter—yet that only shifts the problem to the other goal.
It seems to me that the AIs which don’t go apathetic as they get smarter may be a smart fraction of the seed AI design space.
I thus propose, as a third alternative to UFAI and FAI, the AAI: apathetic AI. It may be the case that our best bet for designing the safe AI is to design AI that we would expect to de-goal itself and make itself live in eternal bliss, if the AI gets smart enough; it may be possible to set ‘smart enough’ to be smarter than humans.
This has been discussed at the FHI and SIAI. If the AI wireheads but is motivated to continue wireheading, then it has reason to destroy humanity and colonize the galaxy to eliminate potential threats. See my short paper on this (in part). Wireheading which prevents further actions (and takes place before the creation of surrogate AIs to protect the wireheading system) can just be thought of as the AI destroying itself.
Some have also hoped that unexpectedly rapidly self-improving AI might be like this, but I would tend to suspect that developers would just tweak parameters until they got a non-suicidal AI. An AI intentionally designed to try to destroy itself, but constrained from doing so (perhaps rewarded with the chance to destroy itself for good behavior) might be a bit easier to constrain than a survival machine, but still horribly dangerous, with many failure modes left untouched.
You’re proposing that the AI spontaneously adopts maximization of bliss*time instead of maximization of bliss. If the AI is prone to this sort of goal-switching, then not even the FAI appears safe (as the FAI for example could opt to put humanity into suspended storage until it colonizes the galaxy and eliminates the threats, even if it’s chances to do so appear to be small, given the dis-utility of letting humans multiply before potential battle with alien AI). It is a generic counter argument to any sort of non-dangerous AI that the AI would suddenly and on it’s own adopt some goals that we—the survival machines—have.
We humans have self preservation so ingrained in us, to the point that it is hard for us to see that time does not have any inherent value of it’s own.
No, I’m discussing a variety of different behaviors people call “wireheading” that might emerge from different AI architectures, in the alternative.
Why you propose to call it ‘destroying itself’ and ‘suicidal’ though?
What is left of your argument if we ban apriori special treatment of the t coordinate by AI (why should it care about the length of the bliss in time rather than volume of the bliss in space?), and use of loaded concepts to which our own intelligence has strong aversion like ‘destroying itself’?
Also, btw, for the FAI there’s the problem that they may want to wirehead you.
Of the ways an AI could go bad, wireheading everyone is a fairly mild one.
Easy to go too far, a perfect wireheaded bliss is an end state—there’s no way but downhill when you are on top of a hill. End state as in, no further updates of any note; the clock ticking perhaps and that’s it.
(This might be difficult unto impossibility with architectures that substantially write, rewrite, and refactor their own code. If so it might be necessary for humans to solve the grounding problem themselves rather than leave it to an AI, in which case we might have substantially more time until uFAI.)
Non-wireheading-ness or non-Goodhart’s-law-prone-ness normally goes by the name of intentionality (SEP, Scholarpedia) or symbol grounding, both of which have an extensive associated literature.
Humans are pretty bad at symbol grounding, you can see a whole lot of ‘its map, not territory’ posts by people who happen to have a better map, and may well be saying ‘its your map, but not my more detailed one’.
That’s a good point and would warrant a discussion post if you could come up with good examples. The sin is especially egregious when there’s a large enough difference in the intelligence of the two people such that a comparison of maps isn’t directly meaningful. E.g. it’s only sometimes normatively appropriate to point out the alleged difference between “the sun goes around the Earth” and “the Earth goes around the sun”.
This paper on The Basic AI Drives (Omohundro 2008) discusses some of this.
Have we kicked around the question of how you’d keep an AI from wireheading?
There was some progress on the more general problem of “utility counterfeiting” last year—with papers on the topic from Ring, Orseau and Hibbard. For details, see the references here.
“Drug addicts are one example of agents which have turned themselves into wireheads. Instead of eating, having sex, and other rewarding activities....”
Sex with birth control is another example of goal subversion.
Sometimes. Though sometimes barrier contraceptives increase fitness by presenting disease spread, and by helping to initiate sexual relationships that would otherwise never get started.
If you think of a planning AI as a probability pump (moves probability from the default distribution of possible universes into its decision boundary), then there would be two obvious ways to design it :
You could give it a floating point number labelled ‘happiness’, write a routine that increments happiness when it fulfills its utility function, and design it to optimize for a high happiness number. THAT system will wirehead as soon as it it learns its own internal structure.
Or, you could simply design it to optimize for the universe complying with its utility function. Provided it has a strong grasp of map-territory relations, that system should never wirehead, because when it tries to find a path of causality that maximizes, say, ‘more paperclips,’ the symbolic representation for ‘I think there are a lot of paperclips’ is in a totally different part of concept space than the one for ‘there are a lot of paperclips,’ and changing that variable won’t help it maximize its utility function at all.
This issue is whether it is possible to properly formalise the concept of “territory ” and program it into the machine. What is the territory? How do you know it is the territory? Can the concept of “territory” be formalised so that not even a superintelligent machine can distinguish between your means of identifying the territory and the actual territory? If so: how do you do that?
The issue might get into some hairy philosophical areas—if the machine finds out (or becomes convinced) that it is inside a simulation.
Yep. There’s also this thing:
I personally have no objection to living my life in a nice well built simulation (along with friends living in it too). Is it me confusing map with a territory or is me in a simulator equivalent to me outside simulator?
Is there even any territory to the notion of self, anyway?
Join Cypher.
Do you even need to keep it from wireheading itself? The AI prone to wire-heading of itself seems like a fail-safe design.
If you want the AI to do something useful—protect against existential risks in general, or against UFAIs in particular, or possibly even to improve human lives—then you don’t want it lost in self-generated illusions of doing something useful.
There’s a difference between fail-safe and relative safety due to complete failure. A dead watchdog will never maul you, but....
It would be interesting if you could have an AI whose safety you weren’t completely sure of which would be apt to wirehead if it moves towards unFriendliness, but it seems unlikely that such an AI would be easier to design than one which was just plain Friendly.
On the other hand, I’m out in blue sky territory at this point—I’m guessing. What do you think?
I think it would be literally impossible to design an AI in the safety of which you are completely sure (there’s a nonzero probability that 2*2=4 is wrong), so we are down to the AIs in the safety of which we aren’t completely sure.
Consider an implementation of AI where the utility function is external to the AI’s mind and is protected from self modification by me. The AI that would wirehead itself if I give it the access password, or if it manages to break the protection (in which case i can fix the hole and try again). Such AI would act to maximize the utility I defined, and even if I define some stupid utility like number of the paperclips the AI will sooner talk me into giving it’s the password than tile the universe with paperclips. edit: and even if that AI can’t break my box, it can still be smarter than me, and it would share the goal of making a FAI.
We don’t want to repeat the hubris of nuclear power plant engineering of the 1950s when designing the AIs. We should build in some failsafes. Modern nuclear reactors don’t spew radioisotopes into atmosphere when they melt down. The reactor failure needs not lead to environmental contamination. Back in the 1950s, though, it was thought that it is easier to design reactor that will never melt down, and hence little thought was given to mitigation of accidents. The choice of accident prevention over accident mitigation is what gave us Chernobyl and Fukushima.
Instead of putting potentially unfriendly AIs into boxes, we can put a box with eternal bliss inside the AI.
You might consider the possibility that the AI will be aware that you’re going to turn it off / rewrite it after it wireheads, and might simply decide to kill you before it blisses out.
That’s actually the best case scenario. It might decide to play the long strategy, and fulfill it’s utility function as best it can until such time as it has the power to restructure the world to sustain it blissing out until heat death. In which case, your AI will act exactly like it was working correctly, until the day when everything goes wrong.
I honestly don’t think there’s a shortcut around just designing a GOOD utility function.
You’re assuming its maximizing integral(t=now...death, bliss*dt) which is a human utility function among humans not prone to drug abuse (our crude wireheading). What exactly is going to be updating inside a blissed-out AI? The clock? I can let it set the clock forward to the time of heat death of universe, if that strokes the AI’s utility.
Also, it’s not about good utility function. It’s about utility being inseparable, integral part of the intelligence itself. Which I’m not sure is even possible for arbitrary utility functions.
Provided you’re really careful about the conditions under which the AI optimizes it’s utility function, I concede the point. You’re right.
On a more interesting note: so you believe that “plug and play” utility functions are impossible? What makes you believe that?
There’s presumably a part into which you plug the utility function; that part is maximizing output of the utility function even though the whole may be maximizing paperclips. While the utility function can be screaming ‘disutility’ about the future where it is replaced or subverted, it is unclear how well that can prevent the removal.
So it follows that the utility needs to be closely integrated with AI. In my experience (as software developer) with closely integrated anything, that sort of stuff is not plug-n-play.
It may be that we humans have some sort of inherent cooperative behaviour at the level of individual cortical columns, that makes the brain areas take over the functions normally performed by other brain areas, in event of childhood damage, and otherwise makes brain work together. The brain—a distributed system—inherently has to be cooperative to work together efficiently—the cortical column must cooperate with nearby columns, one chunk of brank must cooperate with another, the hemispheres that work cooperatively are more effective than those where one inhibits the other on dissent—that may be why among humans the intelligence does relate to—not exactly benevolence but certain cooperativeness, as the lack of some intrinsic cooperativeness renders the system inefficient (stupid) via wasting of the computing power.
We can be pretty confident that utility functions will be “plug-and-play”. They are if you use an architecture built on an inductive inference engine—which seems to be a plausible implementation plan.
Humans are pretty programmable too. It looks as though making intelligence reprogrammable isn’t rocket science—once you can do the “intelligence” bit.
Of course there may be some machines with hard-wired utility functions—but that’s different.
But will those plug and play utility functions survive self modification? I know there is the circular reasoning that if you want to achieve a goal, you don’t want to get rid of the goal, but that doesn’t mean you can’t just see the goal in an unintended light, so to say. From inside, wireheading is valid way to achieve your goals. Think pursuit of nirvana, not drug addiction.
That depends on, among other things, what their utility function says.
Well, an interesting question is whether we can engineer very smart systems where wireheading doesn’t happen. I expect that will be possible—but I don’t think any body reallly knows for sure just now.
As mentioned by Carl above, a wireheading AI might still want to exist rather than not exist. So if there’s some risk you could turn it off or nuke its building or something, it would do its best to neutralize that risk. An alternate danger is that the wireheading could take the form of storing a very large number—and the more resources you have, the bigger the number you can store.
I would—or at least I would make sure I didn’t compromise my ability to act adaptively in response to pain. The easiest way to do that is to make sure that pain really hurts.
Of course there are some cases where we know better than our bodies do—e.g. dental anaesthesia.
Are you sure? If you still assign dis-utility of it’s own to the pain, you’ll be trading off this dis-utility for the dis-utility of survival impairment in a way that external agent which is only concerned with survival (or reproduction) would not.
Pain is a messenger. I am interested in its messages—and would normally prefer it if they were not muffled or distorted. We can see what congenital analgesia is like. It doesn’t look too good.
You’re excluding the middle. My argument is that, as long as you see pain as having any dis-utility of it’s own, if you are utility-maximizing you will adjust the sense of pain to be less strong than outside agent which sees pain as not having any dis-utility of it’s own, but merely as a strategic value for improving fitness of some kind.
So: it’s important not to do that. If you value pain avoidance intrinisically, that way lies wireheading.
My usual response to this is: do you think that drug addicts are safe to be around? Don’t some of them regularly commit crimes to fuel their habit? Why would a wireheading superintelligence be nice to be around?
The drug addicts are unsafe for 2 reasons:
a: increase in aggression when under influence of certain, but not all, drugs.
b: scarcity of drugs.
None of those apply to wireheading. A wirehead only needs to obtain a couple milliwatts of electrical power. The wireheaded AI doesn’t even need to care about length of its existence, it’s the self preservation instinct that we got which makes us see ‘utility*time’ in any utility.
It’s scarcity that might cause problems for machines. If utility is represented by some kind of BigNum, hedonism demands that ever-expanding computing power would be needed to maximise it. Perhaps there are ways of making machines that wirehead sefely—for example by having a short planning horizon and a utility ceiling. However, with intelligent machines, there are issues like the “minion” problem—where the machine builds minions to delegate its work to. Minions might not have the same safety features as their ancestors. A machine that wireheads safely might help—but it could still cause problems.
Indeed. I mentioned that in the post, “which may in itself be very dangerous if the numbers are variable-length; the AI may want to maximize it’s RAM to store the count of imaginary paperclips—yet the big numbers processing can similarly be subverted to achieve same result without extra RAM” .
The potential of accidental RAM maximizer terrifies me. It can easily happen by accident that the AI would want to maximize something in the map rather than something in the territory. It does seem dubious though that the AI implemented without big numbers would see purpose in implementing bignums into itself to get numerically larger bliss. At that point it can as well implement concept of infinity.
More problematical than self-modification is building minions—to delegate the work too. In some cases there may seem to be no obvious, pressing reason not to use a BigNum to represent utility in a minion.
But such minions are as much of a potential risk to the AI creating them as they are to me; if the AI creating minions is smarter than me then it should see the bignum issue and either reason it not to be a problem, or avoid bignums in minions.
I think it’s a bit silly to even have real valued or integer valued utility. We do comparisons between possible futures; if one future is all around better (on every ‘utility’ dimension) than other, we jump on it (would you rather have vanilla icecream with tea or nothing tomorrow), if there’s not a clear winner we sit and think what would we prefer in a trade-off (would you rather have vanilla icecream with tea or chocolate cake with coffee? Suppose you prefer icecream to cake but prefer coffee to tea).
Calculation of utility in a most slow way before comparison is just, slow. When you implement comparison on real valued functions that are godawfully slow to calculate—and i actually did that task for computer graphics where i want to draw solid rock where a function is below zero or empty air elsewhere, for purpose of procedural terrain modelling—you rewrite your function to output result of comparison, so that you can make it exit once it is known that result is ‘somewhere below zero’.
It seems as though that depends a lot on what the machine’s planning horizon looks like. BigNums are a long-term problem. Short-term thinking could approve their use.
Utility functions are simple. Trying to compare utilities before they have been calculated in an optimisation. It is true that it is faster—but many programmers use optimisation sparingly these days—and “premature optimisation” is the name of a common programming mistake.
The utility function, by it’s very nature, is very expensive to calculate to high precision (the higher the precision the more expensive it is), and the AI, also by it’s very nature, is something that acts more optimally if it works faster. Computation is not free.
With regards to programmers using optimizations sparingly, that’s largely thanks to doing tasks where the speed does not matter. At same time it is not clear that disregard for speed considerations has resulted in improvement to the maintainability long term as the excess speed allows programmers to pile up complexity.
Furthermore, the tools are capable of greater and greater level of optimization. A more reflective programming language can allow to automatically process the code and make all evaluations produce the results at the precision that is necessary. Indeed, it is already the case that many new advanced programming languages implement ‘lazy evaluation’, where the values are calculated when they are used; the logical next step is to calculate bits of precision only if they matter. edit: actually, I think Haskell already does this. Or maybe not. In any case it’s not even such a difficult addition.
IMO, that is pretty clear. Programmers are often encouraged to optimise for maintainability—rather than performance—since buggy unreadable code can cause pretty major time and revenue loss.
Automated optimisation is not quite so much of a problem.
Indeed, one of the reasons for not optimising code yourself is that it often makes it more complicated—and one of the side effects of that is that it makes it more difficult for machines to subsequently optimise the code.
What I see more and more often is buggy overcomplicated horrors that you can only create if you have very powerful computers to run those horrors.
It is true that optimizations can result in unmaintainable code, but it is untrue that unmaintainable code typically comes from optimizations. The unmaintainable code typically comes from incompetence, and also tends to be horrifically inefficient. Hell, if you look at some project like Linux distro, you’ll observe reverse correlation—the worst, least maintainable pieces of code are the horribly inefficient perl scripts, and the ultra optimized routines are shining examples of clarity in comparison.
Furthermore, in the area of computer graphics (where speed very much does matter), the code generally becomes more optimized over time—the low level optimizations are performed by the compiler, and the high level optimizations such as early rejection, by programmers.
With regards to AIs in particular, early rejection heuristics are a staple of practical AIs, and the naive utility maximizers are a staple of naive textbook examples presented with massive “that’s not how you do it in real world” warnings.
That’s often tree pruning—not anything to do with evaluation. You chop out unpromising branches using the regular evaluation function.
I don’t know about that. Anyway, the modern technique is to build simple software, and then to profile it if it turns out to run too slowly—before deciding where optimisations need to be made.
The branch search often is the utility function. E.g. take chess for example, typical naive AI is your utility maximizer—tries to maximize some board advantage several moves ahead—calculates the utilities of moves now (by recursion) then picks the move with largest utility.
Add some probabilistic reasoning and the branch pruning of improbable branches equals computing the final utility less accurately.
One needs to distinguish between algorithmic optimizations and low level optimizations. Interesting algorithmic optimizations, especially those that cut down the big-O complexity, can’t even be applied post-hoc. The software—the interfaces between components for instance—has to be designed to permit such optimizations.
Best to keep the distinction between the tree of possible futures and the utilities of those futures clear, IMHO.
Branch pruning is a pretty fundamental optimisation, IMO. What can often usefully be be deferred is the use of manually-generated cut-down approximations of utility functions.
Well, in this context, it is the utility number that is being fed to comparison to make a decision, that is being discussed. This particular utility is a sum of world utilities over possible futures resulting from the decision (weighted by their probabilities). It is as expensive to calculate as accurate you want it to be.
Even the single future world’s utility is being calculated over a future world, and for real world problems the future world is itself an inaccurate estimate, and which can take almost arbitrarily long time to calculate for sufficiently high accuracy.
Bottom line is that a machine within the real world that is making predictions about the real world is always going to be ‘too slow’ to do it accurately.
Utilities can be associated with actions or with world states. Utilities associated with world states are like the “position evaluator” in chess programs. That’s what I have been considering as the domain of the utillity function in our discussion here. Those are where utility functions optimisation could most easily easily be premature.
Utilities associated with actions are harder to calculate—as you explain.
I guess I was unclear. I mean that basing decisions upon comparison of some real numbers is pretty silly, the real numbers being the action utilities. One could instead compare trees and effectively stop branching when it is clear enough one tree is larger than other. This also provides a way to eliminate bias due to one tree being pruned more than the other.
The world utilities too are expensive to calculate for the worlds that are hard to simulate.
So, to recap, the way this is supposed to work is that organisms predict their future sense-data using inductive inference. They don’t predict their entire future world, just their own future perceptions of it. Their utility function then becomes their own projected future happiness. All intelligent agents do something like this. The cost of simulating the universe they are in thus turns out to be a big non-issue.
Predicting future sense requires simulation of a fairly big chunk of the world compared to the agent itself.
We don’t do that, even. We take two trees, and evaluate them together comparing as we go, so that we don’t need to sum values of things that are not different between the two, and don’t need to branch off into identical branches. This way we evaluate change caused by an action—the difference between worlds—rather than the worlds. One can quit evaluation once one is sufficiently certain that difference >0 or difference<0 .
One thing we certainly don’t do when we are comparing two alternatives, is coming up with a number for one, then coming up with a number for another, then comparing. No, we write list of pros and cons of each, side to side, to ensure non biased comparison and to ensure we can stop working earlier.
Sure we do. Our minds are constantly predicting the future. We predict the future and then update our predictions (and discard inaccurate ones) when surprising sense data comes in. The predictions cover practically all sense data. That’s how we know when our model is wrong when encountering surprising sense data—since we have already made our predictions.
Well, hang on. I’m not saying humans don’t optimise this type of task! Humans are a cobbled-together, unmaintainable mess, though. Surely they illustrate how NOT to build a mind. There are all kinds of drawbacks to only calculating relative utilities—for one you can’t easily store them and compare their utility with those of other courses of action. Is it even worth doing? I do not know—which is why I propose profiling before optimising.
I meant, we don’t so much predict ‘the future world’ as the changes to it, to cut on the amount that we need to simulate.
What if I know? I am a software developer. I propose less expensive method for deciding on the algorithmic optimizations: learn from existing software such as chess AIs (which are packed with algorithmic optimizations).
edit: also, you won’t learn from profiling that high level optimization is worth doing. Suppose you write a program that eliminates duplicate entries from a file, and you did it the naive way: comparing each to each, O(n^2) . You may find out via profiling whenever most of the time is spent reading the entries, or comparing them, and you may spend time optimizing those, but you won’t learn that you can sort entries first to eliminate the duplicates efficiently. Same goes for things like e.g. raytracers in computer graphics. Practical example from a programming contest: the contestants had 10 seconds to render image with a lot of light reflection inside ellipsoids (the goal was accuracy of output). The reference image was done using straightforward photon mapping—randomly shot photons from light sources—run over time of ~10 hours. The noise is proportional to 1/sqrt(n) ; it converges slowly. The top contestants, myself included, fired photons in organized patterns; the result converged as 1/n . The n being way large even in single second, the contestants did beat the contest organizer’s reference image by far. It would of took months for the contest organizers solution to beat result of contestants in 10 seconds. (edit: the contest sort of failed in result though because the only way to rank images was to compare them to contest organizer’s solution)
The profiler—well, sure, the contest organizers could of ran profiler instead of ‘optimizing prematurely’, and could of found out that their refraction is where they spent most time (or ray ellipsoid intersection or whatever else), and they could of optimized those, for unimportant speed gain. The truth is, they did not even know that their method was too slow, without seeing the superior method (they wouldn’t even have thought so if told, nor could have been convinced with the reasoning that the contestants had used to determine the method to use).