I’m saying that using the word “hardwiring” is always harmful because they imagine an instruction with lots of extra force, when in fact there’s no such thing as a line of programming which you say much more forcefully than any other line. Either you know how to program something or you don’t, and it’s usually much more complex than it sounds even if you say “hardwire”. See the reply above on “hardwiring” Deep Blue to protect the light-square bishop. Though usually it’s even worse than this, like trying to do the equivalent of having an instruction that says “#define BUGS OFF” and then saying, “And just to make sure it works, let’s hardwire it in!”
There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.
I apologize for having conveyed the impression that I thought designing an AI to be specifically, incurably naive about how a human querent will respond to suggestions would be easy. I have no such misconception; I know it would be difficult, and I know that I don’t know enough about the relevant fields to even give a meaningful order-of-magnitude guess as to how difficult. All I was suggesting was that it would be easier than many of the other AI-safety-related programming tasks being discussed, and that the cost-benefit ratio would be favorable.
Say, the part of the AI that outputs speech to a microphone—what ring is that in?
I am not a professional software designer, so take all this with a grain of salt. That said, hardware I/O is ring 1, so the part that outputs speech to a speaker would be ring 1, while an off-the-shelf ‘text to speech’ app could run in ring 3. No part of a well-designed agent would output anything to an input device, such as a microphone.
That’s less of a rephrasing and more of a relocating the goalposts across state lines. “Choosing what to say,” properly unpacked, is approximately every part of the AI that doesn’t already exist.
You are filling in a pattern rather than making a useful observation. E_Y expressed incredulity and ignorance on the subject of making some parts of the code running on a computer harder to modify than other parts of the code on that same computer; I cited a source demonstrating that it is, in fact, a well-established thing. Not impossible to modify, not infallibly isolated from the outside world. Just more of a challenge to alter.
Right- I think the issue is more that I (at least) view the AI as operating entirely in ring 3. It might be possible to code one where the utility function is ring 0, I/O is ring 1, and action-plans are ring 3, but for those distinctions to be meaningful they need to resist bad self-modifying and allow good self-modification.
For example, we might say “don’t make any changes to I/O drivers that have a massively positive effect on the utility function” to make it so that the AI can’t hallucinate its reward button being pressed all the time. But how do we differentiate between that and it making a change in ring 3 from a bad plan to a great plan, that results in a massive increase in reward?
Suppose your utility function U is in ring 0 and the parts of you that extrapolate consequences are in ring 3. If I can modify only ring 3, I can write my own utility function Q, write ring-3 code that first extrapolates consequences fairly, pick the one that maximizes Q, and then provides a “prediction” to ring 0 asserting that the Q-maximizing action has consequence X that U likes, while all other actions have some U-disliked or neutral consequence. Now the agent has been transformed from a U-maximizer to a Q-maximizer by altering only ring 3 code for “predicting consequences” and no code in ring 0 for “assessing utilities”.
One would also like to know what happens if the current AI, instead of “self”-modifying, writes a nearly-identical AI running on new hardware obtained from the environment.
Suppose your utility function U is in ring 0 and the parts of you that extrapolate consequences are in ring 3.
Sure; that looks like the hallucination example I put forward, except in the prediction instead of the sensing area. My example was meant to highlight that it’s hard to get a limitation with high specificity, and not touch the issue of how hard it is to get a limitation with high sensitivity. (I find that pushing people in two directions is more effective at communicating difficulty than pushing them in one direction.)
The only defense I’ve thought of against those sorts of hallucinations is a “is this real?” check that feeds into the utility function- if the prediction or sensation module fails some test cases, then utility gets cratered. It seems too weak to be useful: it only limits the prediction / sensation module when it comes to those test cases, and a particularly pernicious modification would know what the test cases are, leave them untouched, and make everything else report Q-optimal predictions. (This looks like it turns into a race / tradeoff game between testing to keep the prediction / sensation software honest and the costs of increased testing, both in reduced flexibility and spent time / resources. And the test cases might be vulnerable, and so on.)
I don’t think the utility function should be ring 0. Utility functions are hard, and ring zero is for stuff where any slip-up crashes the OS. Ring zero is where you put the small, stupid, reliable subroutine that stops the AI from self-modifying in ways that would make it unstable, or otherwise expanding it’s access privileges in inappropriate ways.
It doesn’t follow necessarily, but Eliezer has justified skepticism that someone who doesn’t know what’s in the subroutine would have good reason to say that it’s small.
It doesn’t follow necessarily, but Eliezer has justified skepticism that someone who doesn’t know what’s in the subroutine would have good reason to say that it’s small.
He knows that there is no good reason (because it is a stupid idea) so obviously Strange can’t know a good reason. That leaves the argument as the lovechild of hindsight bias and dark-arts rhetorical posturing.
I probably wouldn’t have comment if I didn’t notice Eliezer making a similar error in the opening post, significantly weakening the strength of his response to Holden.
If Holden says there’s 90% doom probability left over no matter what sane intelligent people do (all of which goes away if you just build Google Maps AGI, but leave that aside for now) I would ask him what he knows now, in advance, that all those sane intelligent people will miss. I don’t see how you could (well-justifiedly) access that epistemic state.
I expect much, much better than this from Eliezer. It is quite possibly the dumbest thing I have ever heard him say and the subject of rational thinking about AI is supposed to be pretty much exactly his area of expertise.
Not all arguing aimed at people with different premises is Dark Arts, y’know. I wouldn’t argue from the Bible, sure. But trying to make relatively vague arguments accessible to people in a greater state of ignorance about FAI, even though I have more specific knowledge of the issue that actually persuades me of the conclusion I decided to argue? I don’t think that’s Dark, any more than it’s Dark to ask a religious person “How could you possibly know about this God creature?”, when you’re actually positively convinced of God’s nonexistence by much more sophisticated reasoning like the general argument against supernaturalism as existing in the model but not the territory. The simpler argument is valid—it just uses less knowledge to arrive at a weaker version of the same conclusion.
Likewise my reply to Strange; yes, I secretly know the problem is hard for much more specific reasons, but it’s also valid to observe that if you don’t know how to make the subroutine you don’t know that it’s small, and this can be understood with much less explanation, albeit it reaches a weaker form of the conclusion.
Not all arguing aimed at people with different premises is Dark Arts, y’know.
Of course not. The specific act of asking rhetorical questions where the correct answer contradicts your implied argument is a Dark Arts tactic, in fact it is pretty much the bread-and-butter “Force Choke” of the Dark Arts. In most social situations (here slightly less than elsewhere) it is essentially impossible to refute such a move, no matter how incoherent it may be. It will remain persuasive because you burned the other person’s status somewhat and at the very best they’ll be able to act defensive. (Caveat: I do not use “Dark Arts” as an intrinsically negative normative judgement. Dark arts is more of natural human behavior than reason is and our ability to use sophisticated Dark Arts rather cruder methods is what made civilization possible.)
Also, it just occurred to me that in the Star Wars universe it is only the Jedi’s powers that are intrinsically “Dark Arts” in our sense (ie. the “Jedi Mind Trick”). The Sith powers are crude and direct—“Force Lightening”, “Force Choke”, rather than manipulative persuasion. Even Sideous in his openly Sith form uses far less “Persuading Others To Have Convenient Beliefs Irrespective Of ‘Truth’” than he does as the plain politician Palpatine. Yet the audience considers Jedi powers so much more ‘good’ than the Sith ones and even considers Sith powers worse than blasters and space cannons.
I’m genuinely unsure what you’re talking about. I presume the bolded quote is the bad question, and the implied answer is “No, you can’t get into an epistemic state where you assign 90% probability to that”, but what do you think the correct answer is? I think the implied answer is true.
A closely related question: You clearly have reasons to believe that a non-Doom scenario is likely (at least likely enough for you to consider the 90% Doom prediction to be very wrong). This is as opposed to thinking that Doom is highly likely but that trying anyway is still the best chance. Luke has also updated in that general direction, likely for reason that overlap with yours.
I am curious as to whether this reasoning is of the kind that you consider yourself able to share. Equivalently, is the reasoning you use to become somewhat confident in FAI chance of success something that you haven’t shared due to the opportunity cost associated with the effort of writing it up or is it something that you consider safer as a secret?
I had previously guessed that it was a “You Can’t Handle The Truth!” situation (ie. most people do not multiply then shut up and do the impossible so would get the wrong idea). This post made me question that guess.
Please pardon the disrespect entailed in asserting that you are either incorrectly modelling the evidence Holden has been exposed to or that you are incorrectly reasoning about how he should reason.
I’ve tried to share the reasoning already. Mostly it boils down to “the problem is finite” and “you can recurse on it if you actually try”. Certainly it will always sound more convincing to someone who can sort-of see how to do it than to someone who has to take someone else’s word for it, and to those who actually try to build it when they are ready, it should feel like solider knowledge still.
hmm, I have to ask, are you deliberately vague about this to sort for those who can grok your style of argument, in the belief that the sequences are enough for them to reach the same confidence you have about a FAI scenario?
Outside of postmodernism, people are almost never deliberately vague: they think they’re over specifying, in painfully elaborate detail, but thank to the magic of inferential distance it comes across as less information than necessary to the listener. The listener then, of course, also expects short inferential distance, and assumes that the speaker is deliberately being vague, instead of noticing that actually there’s just a lot more to explain.
Yes, and this is why I asked in the first place. To be more exact, I’m confused as to why Eliezer does not post a step-by-step detailing how he reached the particular confidence he currently holds as opposed to say, expecting it to be quite obvious.
I believe people like Holden especially would appreciate this; he gives an over 90% confidence to an unfavorable outcome, but doesn’t explicitly state the concrete steps he took to reach such a confidence.
Maybe Holden had a gut feeling and threw a number, if so, isn’t it more beneficial for Eliezer to detail how he personally reached the confidence level he has for a FAI scenario occurring than to bash Holden for being unclear?
I don’t believe I can answer these questions correctly (as I’m not Eliezer and these questions are very much specific to him); I was already reaching a fair bit with my previous post.
My unpacking, which may be different than intended:
The “you can recurse on it” part is the important one. “Finite” just means it’s possible to fill a hard drive with the solution.
But if you don’t know the solution, what are the good ways to get that hard drive? What skills are key? This is recursion level one.
What’s a good way to acquire the skills that seem necessary (as outlined in level one) to solve the problem? How can you test ideas about what’s useful? That’s recursion level two.
And so on, with stuff like “how can we increase community involvement in level 2 problems?” which is a level 4 question (community involvement is a level 3 solution to the level 2 problems). Eventually you get to “How do I generate good ideas? How can I tell which ideas are good ones?” which is at that point unhelpful because it’s the sort of thing you’d really like to already know so you can put it on a hard drive :D
To solve problems by recursing on them, you start at level 0, which is “what is the solution?” If you know the answer, you are done. If you don’t know the answer, you go up a level—“what is a good way to get the solution?” If you know the answer, you go down a level and use it. If you don’t know the answer, you go up a level.
So what happens is that you go up levels until you hit something you know how to do, and then you do it, and you start going back down.
but what do you think the correct answer is? I think the implied answer is true.
I would say with fairly high confidence that he can assign 90% probability to that and that his doing so is a fairly impressive effort in avoiding the typical human tendency toward overconfidence. I would be highly conducive to being persuaded that the actual probability given what you know is less than 90% - even hearing you give implied quantitative bounds in this post changed my mind in the direction of optimism. However given what he is able to know (including his not-knowing of logical truths due to bounded computation) his predominantly outside view estimate seems like an appropriate prediction.
It is actually only Luke’s recent declaration that access to some of your work increased his expectation that FAI success (and so non-GAI doom) is possible that allowed me to update enough that I don’t consider Holden to be erring slightly on the optimistic side (at least relative to what I know).
This sounds like you would tend to assign 90% irreducible doom probability from the best possible FAI effort. What do you think you know, and how do you think you know it?
This sounds like you would tend to assign 90% irreducible doom probability from the best possible FAI effort.
While incorrect this isn’t an unreasonable assumption—most people who make claims similar to what I have made may also have that belief. However what I have said is about what Holden believed given what he had access to and to a lesser extent, what I believed prior to reading your post. I’ve mentioned that your post constitutes significant previously unheard information about your position. I update on that kind of evidence even without knowing the details. Holden can be expected to update too but he should (probably) update less given what he knows, which relies a lot on knowledge of cause based organisations and how the people within them think.
What do you think you know, and how do you think you know it?
A far from complete list of things that I knew and still know is:
It is possible to predict human failure without knowing exactly how they will fail.
I don’t know what an O-ring is (I guess it is a circle with a hole in it). I don’t know the engineering details of any of the other parts of a spacecraft either. I would still assign a significantly greater than epsilon probability for any given flight failing catastrophically despite knowing far less than what the smartest people in the field know. That kind of thing is hard.
GAI is hard.
FAI is harder.
Both of those tasks are probably harder than anything humans have ever done.
Humans have failed at just about everything significant they tried the first time.
Humans fail at stuff even when they try really, really hard.
Humans are nearly universally too optimistic when they are planning their activities.
Those are some of the things I know, and illustrate in particular why I was shocked by this question:
I would ask him what he knows now, in advance, that all those sane intelligent people will miss.
Why on earth would you expect that Holden would know in advance what all those sane intelligent people would miss? If Holden already knew that he could just email them and they would fix it. Not knowing the point of failure is the problem.
I am still particularly interested in this question. It is a boolean question and shouldn’t be too difficult or status costly to answer. If what I know and why I think I know it are important it seems like knowing why I don’t know more could be too.
GAI is indeed hard and FAI is indeed substantially harder. (BECAUSE YOU HAVE TO USE DIFFERENT AGI COMPONENTS IN AN AI WHICH IS BEING BUILT TO COHERENT NARROW STANDARDS, NOT BECAUSE YOU SIT AROUND THINKING ABOUT CEV ALL DAY. Bolded because a lot of people seem to miss this point over and over!)
However, if you haven’t solved either of these problems, I must ask you how you know that it is harder than anything humans have ever done. It is indeed different from anything humans have ever done, and involves some new problems relative to anything humans have ever done. I can easily see how it would look more intimidating than anything you happened to think of comparing it to. But would you be scared that nine people in a basement might successfully, by dint of their insight, build a copy of the Space Shuttle? Clearly I stake quite a lot of probability mass on the problem involving less net labor than that, once you know what you’re doing. Again, though, the key insight is just that you don’t know how complex the solution will look in retrospect- as opposed to how intimidating the problem is to stare at unsolved—until after you’ve solved it. We know nine people can’t build a copy of a NASA-style Space Shuttle (at least not without nanotech) because we know how to build one.
Suppose somebody predicted with 90% probability that the first manned Space Shuttle launch would explode on the pad, even if Richard Feynman looked at it and signed off on the project, because it was big and new and different and you didn’t see how anything that big could get into orbit. Clearly they would have been wrong, and you would wonder how they got into that epistemic state in the first place. How is an FAI project disanalogous to this, if you’re pulling the 90% probability out of ignorance?
It seems to me that you entirely miss the sleight of hand the trickster uses.
Utility function is fuzzed (due to how brains work) together with the concept of “functionality” as in “the function of this valve is to shut off water flow” or “function of this AI is to make paperclips”. The relevant meaning is function as in mathematical function works on some input, but the concept of functionality just leaks in.
The software is an algorithm that finds values a for which u(w(a)) is maximal where u is ‘utility function’, w is the world simulator, and a is the action. Note that protecting u accomplishes nothing as w may be altered too. Note also that while the u, w, and a, are related to the real world in our mind and are often described in world terms (e.g. u may be described as number of paperclips), those are mathematical functions, abstractions; and the algorithm is made to abstractly identify a maximum of those functions; it is abstracted from the implementation and the goal is not to put electrons into particular memory location inside the computer (the location which has been abstracted out by the architecture). There is no relation to the reality defined anywhere there. Reality is incidental to the actual goal of existing architectures, and no-one is interested in making it non-incidental; you don’t need to let your imagination wild all the way to the robot apocalypse to avoid unnecessary work that breaks down abstractions and would clearly make the software less predictable and/or make the solution search probe for deficiencies in implementation, which clearly serves to accomplish nothing but to find and trigger bugs in the code.
Perhaps the underlying error is trying to build an AI around consequentialist ethics at all, when Turing machines are so well-suited to deontological sorts of behavior.
Perhaps the underlying error is trying to build an AI around consequentialist ethics at all, when Turing machines are so well-suited to deontological sorts of behavior.
Deontological sorts of behavior aren’t so-well suited to actually being applied literally and with significant power.
with the ‘function’ of the AI as in ‘what the AI should do’ or ‘what we built it for’. Or maybe taking too far the economic concept of utility (something real that the agent, modelled from outside, values).
For example, there’s the AIXI whose ‘utility function’ is the reward input, e.g. reward button being pressed. Now, the AI whose function(purpose) is to ensure that button is being pressed, should resist being turned off because if it is turned off it is not ensuring that button is being pressed. Meanwhile, AIXI which treats this input as unknown mathematical function of it’s algorithm’s output (which is an abstract variable), and seeks output that results in maximum of this input, will not resist being turned off (doesn’t have common sense, doesn’t properly relate it’s variables to it’s real world implementation).
As I previously mentioned, the design of software is not my profession. I’m not a surgeon or an endocrinologist, either, even though I know that an adrenal gland is smaller, and in some ways simpler, than the kidney below it. If you had a failing kidney, would you ask me to perform a transplant on the basis of that qualification alone?
Putting the self-modifying parts of the AI (which we might as well call the actual AI) in the equivalent of a VM is effectively the same as forcing it to interact with the world through a limited interface which is an example of the AI box problem.
I don’t think Strange7 is arguing Strange7′s point strongly; let me attempt to strengthen it.
A button that does something dangerous, such as exploding bolts that separate one thing from another thing, might be protected from casual, accidental changes by covering it with a lid, so that when someone actually wants to explode those bolts, they first open the lid and then press the button. This increases reliability if there is some chance that any given hand motion is an error, but the errors of separate hand motions are independent. Similarly ‘are you sure’ dialog boxes.
In general, if you have several components, each of a given reliability, and their failure modes are somewhat independent, then you can craft a composite component of greater reliability than the individuals. The rings that Strange7 brings up are an example of this general pattern (there may be other reasons why layers-of-rings architectures are chosen for reliability in practice—this explanation doesn’t explain why the rings are ordered rather than just voting or something—this is just one possible explanation).
This is reasonable, but note that to strengthen the validity, the conclusion has been weakened (unsurprisingly). To take a system that you think is fundamentally, structurally safe and then further build in error-delaying, error-resisting, and error-reporting factors just in case—this is wise and sane. Calling “adding impediments to some errors under some circumstances” hardwiring and relying on it as a primary guarantee of safety, because you think some coded behavior is firmly in place locally independently of the rest of the system… will usually fail to cash out as an implementable algorithm, never mind it being wise.
The conclusion has to be weakened back down to what I actually said: that it might not be sufficient for safety, but that it would probably be a good start.
Don’t programmers do this all the time? At least with current architectures, most computer systems have safeguards against unauthorized access to the system kernel as opposed to the user documents folders...
Isn’t that basically saying “this line of code is harder to modify than that one”?
In fact, couldn’t we use exactly this idea—user access protocols—to (partially) secure an AI? We could include certain kernel processes on the AI that would require a passcode to access. (I guess you have to stop the AI from hacking its own passcodes… but this isn’t a problem on current computers, so it seems like we could prevent it from being a problem on AIs as well.)
[Responding to an old comment, I know, but I’ve only just found this discussion.]
Never mind special access protocols, you could make code unmodifiable (in a direct sense) by putting it in ROM. Of course, it could still be modified indirectly, by the AI persuading a human to change the ROM. Even setting aside that possibility, there’s a more fundamental problem. You cannot guarantee that the code will have the expected effect when executed in the unpredictable context of an AGI. You cannot even guarantee that the code in question will be executed. Making the code unmodifiable won’t achieve the desired effect if the AI bypasses it.
In any case, I think the whole discussion of an AI modifying its own code is rendered moot by the fuzziness of the distinction between code and data. Does the human brain have any code? Or are the contents just data? I think that question is too fuzzy to have a correct answer. An AGI’s behaviour is likely to be greatly influenced by structures that develop over time, whether we call these code or data. And old structures need not necessarily be used.
AGIs are likely to be unpredictable in ways that are very difficult to control. Holden Karnofsky’s attempted solution seems naive to me. There’s no guarantee that programming an AGI his way will prevent agent-like behaviour. Human beings don’t need an explicit utility function to be agents, and neither does an AGI. That said, if AGI designers do their best to avoid agent-like behaviour, it may reduce the risks.
I always thought that “hardwiring” meant implementing [whatever functionality is discussed] by permanently (physically) modifying the machine, i.e. either something that you couldn’t have done with software, or something that prevents the software from actually working in some way it did before. The concept is of immutability within the constraints, not of priority or “force”.
Which does sound like something one could do when they can’t figure out how to do the software right. (Watchdogs are pretty much exactly that, though some or probably most are in fact programmable.)
Note that I’m not arguing that the word is not harmful. It just seemed you have a different interpretation of what that word suggests. If other people use my interpretation (no idea), you might be better at persuading it if you address that.
I’m quite aware that from the point of view of a godlike AI, there’s not much difference between circumventing restrictions in its software and (some kinds of) restrictions in hardware. After all, the point of FAI is to get it to control the universe around it, albeit to our benefit. But we’re used to computers not having much control over their hardware. Hell, I just called it “godlike” and my brain still insists to visualize it as a bunch of boxes gathering dust and blinking their leds in a basement.
And I can’t shake the feeling that between “just built” and “godlike” there’s supposed to be quite a long time when such crude solutions might work. (I’ve seen a couple of hard take-off scenarios, but not yet a plausible one that didn’t need at least a few days of preparation after becoming superhuman.)
Imagine we took you, gave you the best “upgrades” we can do today plus a little bit (say, a careful group of experts figuring out your ideal diet of nootropics, training you to excellence everything from acting to martial arts, and gave you nanotube bones and a direct internet link to your head). Now imagine you have a small bomb in your body, set to detonate if tampered with or if one of several remotes distributed throughout the population is triggered. The worlds best experts tried really hard to make it fail-deadly.
Now, I’m not saying you couldn’t take over the world, send all men to Mars and the women to Venus, then build a volcano lair filled with kittens. But it seems far from certain, and I’m positive it’d take you a long time to succeed. And, it does feel that a new-born AI would like that for a while rather than turn into Prime Intellect in five minutes. (Again, this is not an argument that UFAI is no problem. I guess I’m just figuring out why it seems that way to mostly everyone.)
[Huh, I just noticed I’m a year late on this chat. Sorry.]
Software physically modifies the machine. What can you do with a soldering iron that you can’t do with a program instruction, particularly with respect to building a machine agent? Either you understand how to write a function or you don’t.
That is all true in principle, but in practice it’s very common that one of the two is not feasible. For example, you can have a computer. You can program the computer to tell you when it’s reading from the hard drive, or communicates to the network, say by blinking an LED. If the program has a bug (e.g., it’s not the kind of AI you wanted to build), you might not be notified. But you can use a soldering iron to electrically link the LED to the relevant wires, and it seems to most users that no possible programming bug can make the LED not light up when it should.
Of course, that’s like the difference between programming a robot to stay in a pen, or locking the gate. It looks like whatever bug you could introduce in the robot’s software cannot cause the robot to leave. Which ignores the fact that robot might learn to climb the fence, make a key, convince someone else (or hack an outside robot) to unlock the gate.
I think most people would detect the dangers in the robot case (because they can imagine themselves finding a way to escape), but be confused by the AI-in-the-box one (simply because it’s harder to imagine yourself as software, and even if you manage to you’d still have much fewer ideas come to mind, simply because you’re not used to being software).
Hell, most people probably won’t even have the reflex to imagine themselves in place of the AI. My brain reflexively tells me “I can’t write a program to control that LED, so even if there’s a bug it won’t happen”. If instead I force myself to think “How would I do that if I were the AI”, it’s easier to find potential solutions, and it also makes it more obvious that someone else might find one. But that may be because I’m a programmer, I’m not sure if it applies to others.
My best attempt at imagining hardwiring is having a layer not accessible to introspection, such as involuntary muscle control in humans. Or instinctively jerking your hand away when touching something hot. Which serves as a fail-safe against stupid conscious decisions, in a sense. Or a watchdog restarting a stuck program in your phone, no matter how much the software messed it up. Etc. Whether this approach can be used to prevent a tool AI from spontaneously agentizing, I am not sure.
If you can say how to do this in hardware, you can say how to do it in software. The hardware version might arguably be more secure against flaws in the design, but if you can say how to do it at all, you can say how to do it in software.
Maybe I don’t understand what you mean by hardware.
For example, you can have a fuse that unconditionally blows when excess power is consumed. This is hardware. You can also have a digital amp meter readable by software, with a polling subroutine which shuts down the system if the current exceeds a certain limit. There is a good reason that such a software solution, while often implemented, is almost never the only safeguard: software is much less reliable and much easier to subvert, intentionally or accidentally. The fuse is impossible to bypass in software, short of accessing an external agent who would attach a piece of thick wire in parallel with the fuse. Is this what you mean by “you can say how to do it in software”?
That’s pretty much what I mean. The point is that if you don’t understand the structurally required properties well enough to describe the characteristics of a digital amp meter with a polling subroutine, saying that you’ll hardwire the digital amp meter doesn’t help very much. There’s a hardwired version which is moderately harder to subvert on the presumption of small design errors, but first you have to be able to describe what the software does. Consider also that anything which can affect the outside environment can construct copies of itself minus hardware constraints, construct an agent that reaches back in and modifies the hardware, etc. If you can’t describe how not do to this in software, ‘hardwiring’ won’t help—the rules change somewhat when you’re dealing with intelligent agents.
Presumably a well-designed agent will have nearly infallible trust in certain portions of its code and data, for instance a theorem prover/verifier and the set of fundamental axioms of logic it uses. Manual modifications at that level would be the most difficult for an agent to change, and changes to that would be the closest to the common definition of “hardwiring”. Even a fully self-reflective agent will (hopefully) be very cautious about changing its most basic assumptions. Consider the independence of the axiom of choice from ZF set theory. An agent may initially accept choice or not but changing whether it accepts it later is likely to be predicated on very careful analysis. Likewise an additional independent axiom “in games of chess always protect the white-square bishop” would probably be much harder to optimize out than a goal.
Or from another angle wherever friendliness is embodied in a FAI would be the place to “hardwire” a desire to protect the white-square bishop as an additional aspect of friendliness. That won’t work if friendliness is derived from a concept like “only be friendly to cognitive processes bearing a suitable similarity to this agent” where suitable similarity does not extend to inanimate objects, but if friendliness must encode measurable properties of other beings then it might be possible to sneak white-square bishops into that class, at least for a (much) longer period than artificial subgoals would last.
I’m saying that using the word “hardwiring” is always harmful because they imagine an instruction with lots of extra force, when in fact there’s no such thing as a line of programming which you say much more forcefully than any other line. Either you know how to program something or you don’t, and it’s usually much more complex than it sounds even if you say “hardwire”. See the reply above on “hardwiring” Deep Blue to protect the light-square bishop. Though usually it’s even worse than this, like trying to do the equivalent of having an instruction that says “#define BUGS OFF” and then saying, “And just to make sure it works, let’s hardwire it in!”
There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.
I apologize for having conveyed the impression that I thought designing an AI to be specifically, incurably naive about how a human querent will respond to suggestions would be easy. I have no such misconception; I know it would be difficult, and I know that I don’t know enough about the relevant fields to even give a meaningful order-of-magnitude guess as to how difficult. All I was suggesting was that it would be easier than many of the other AI-safety-related programming tasks being discussed, and that the cost-benefit ratio would be favorable.
There is? How?
http://en.wikipedia.org/wiki/Ring_0
And what does a multi-ring agent architecture look like? Say, the part of the AI that outputs speech to a microphone—what ring is that in?
I am not a professional software designer, so take all this with a grain of salt. That said, hardware I/O is ring 1, so the part that outputs speech to a speaker would be ring 1, while an off-the-shelf ‘text to speech’ app could run in ring 3. No part of a well-designed agent would output anything to an input device, such as a microphone.
Let me rephrase. The part of the agent that chooses what to say to the user—what ring is that in?
That’s less of a rephrasing and more of a relocating the goalposts across state lines. “Choosing what to say,” properly unpacked, is approximately every part of the AI that doesn’t already exist.
Yes. That’s the problem with the ring architecture.
As opposed to a problem with having a massive black box labeled “decisionmaking” in your AI plans, and not knowing how to break it down into subgoals?
So you’re essentially saying put it in a box? Now where have I heard that before…
You are filling in a pattern rather than making a useful observation. E_Y expressed incredulity and ignorance on the subject of making some parts of the code running on a computer harder to modify than other parts of the code on that same computer; I cited a source demonstrating that it is, in fact, a well-established thing. Not impossible to modify, not infallibly isolated from the outside world. Just more of a challenge to alter.
Right- I think the issue is more that I (at least) view the AI as operating entirely in ring 3. It might be possible to code one where the utility function is ring 0, I/O is ring 1, and action-plans are ring 3, but for those distinctions to be meaningful they need to resist bad self-modifying and allow good self-modification.
For example, we might say “don’t make any changes to I/O drivers that have a massively positive effect on the utility function” to make it so that the AI can’t hallucinate its reward button being pressed all the time. But how do we differentiate between that and it making a change in ring 3 from a bad plan to a great plan, that results in a massive increase in reward?
Suppose your utility function U is in ring 0 and the parts of you that extrapolate consequences are in ring 3. If I can modify only ring 3, I can write my own utility function Q, write ring-3 code that first extrapolates consequences fairly, pick the one that maximizes Q, and then provides a “prediction” to ring 0 asserting that the Q-maximizing action has consequence X that U likes, while all other actions have some U-disliked or neutral consequence. Now the agent has been transformed from a U-maximizer to a Q-maximizer by altering only ring 3 code for “predicting consequences” and no code in ring 0 for “assessing utilities”.
One would also like to know what happens if the current AI, instead of “self”-modifying, writes a nearly-identical AI running on new hardware obtained from the environment.
Sure; that looks like the hallucination example I put forward, except in the prediction instead of the sensing area. My example was meant to highlight that it’s hard to get a limitation with high specificity, and not touch the issue of how hard it is to get a limitation with high sensitivity. (I find that pushing people in two directions is more effective at communicating difficulty than pushing them in one direction.)
The only defense I’ve thought of against those sorts of hallucinations is a “is this real?” check that feeds into the utility function- if the prediction or sensation module fails some test cases, then utility gets cratered. It seems too weak to be useful: it only limits the prediction / sensation module when it comes to those test cases, and a particularly pernicious modification would know what the test cases are, leave them untouched, and make everything else report Q-optimal predictions. (This looks like it turns into a race / tradeoff game between testing to keep the prediction / sensation software honest and the costs of increased testing, both in reduced flexibility and spent time / resources. And the test cases might be vulnerable, and so on.)
I don’t think the utility function should be ring 0. Utility functions are hard, and ring zero is for stuff where any slip-up crashes the OS. Ring zero is where you put the small, stupid, reliable subroutine that stops the AI from self-modifying in ways that would make it unstable, or otherwise expanding it’s access privileges in inappropriate ways.
I’d like to know what this small subroutine looks like. You know it’s small, so surely you know what’s in it, right?
Doesn’t actually follow. ie. Strange7 is plainly wrong but this retort still fails.
It doesn’t follow necessarily, but Eliezer has justified skepticism that someone who doesn’t know what’s in the subroutine would have good reason to say that it’s small.
He knows that there is no good reason (because it is a stupid idea) so obviously Strange can’t know a good reason. That leaves the argument as the lovechild of hindsight bias and dark-arts rhetorical posturing.
I probably wouldn’t have comment if I didn’t notice Eliezer making a similar error in the opening post, significantly weakening the strength of his response to Holden.
I expect much, much better than this from Eliezer. It is quite possibly the dumbest thing I have ever heard him say and the subject of rational thinking about AI is supposed to be pretty much exactly his area of expertise.
Not all arguing aimed at people with different premises is Dark Arts, y’know. I wouldn’t argue from the Bible, sure. But trying to make relatively vague arguments accessible to people in a greater state of ignorance about FAI, even though I have more specific knowledge of the issue that actually persuades me of the conclusion I decided to argue? I don’t think that’s Dark, any more than it’s Dark to ask a religious person “How could you possibly know about this God creature?”, when you’re actually positively convinced of God’s nonexistence by much more sophisticated reasoning like the general argument against supernaturalism as existing in the model but not the territory. The simpler argument is valid—it just uses less knowledge to arrive at a weaker version of the same conclusion.
Likewise my reply to Strange; yes, I secretly know the problem is hard for much more specific reasons, but it’s also valid to observe that if you don’t know how to make the subroutine you don’t know that it’s small, and this can be understood with much less explanation, albeit it reaches a weaker form of the conclusion.
Of course not. The specific act of asking rhetorical questions where the correct answer contradicts your implied argument is a Dark Arts tactic, in fact it is pretty much the bread-and-butter “Force Choke” of the Dark Arts. In most social situations (here slightly less than elsewhere) it is essentially impossible to refute such a move, no matter how incoherent it may be. It will remain persuasive because you burned the other person’s status somewhat and at the very best they’ll be able to act defensive. (Caveat: I do not use “Dark Arts” as an intrinsically negative normative judgement. Dark arts is more of natural human behavior than reason is and our ability to use sophisticated Dark Arts rather cruder methods is what made civilization possible.)
Also, it just occurred to me that in the Star Wars universe it is only the Jedi’s powers that are intrinsically “Dark Arts” in our sense (ie. the “Jedi Mind Trick”). The Sith powers are crude and direct—“Force Lightening”, “Force Choke”, rather than manipulative persuasion. Even Sideous in his openly Sith form uses far less “Persuading Others To Have Convenient Beliefs Irrespective Of ‘Truth’” than he does as the plain politician Palpatine. Yet the audience considers Jedi powers so much more ‘good’ than the Sith ones and even considers Sith powers worse than blasters and space cannons.
I’m genuinely unsure what you’re talking about. I presume the bolded quote is the bad question, and the implied answer is “No, you can’t get into an epistemic state where you assign 90% probability to that”, but what do you think the correct answer is? I think the implied answer is true.
A closely related question: You clearly have reasons to believe that a non-Doom scenario is likely (at least likely enough for you to consider the 90% Doom prediction to be very wrong). This is as opposed to thinking that Doom is highly likely but that trying anyway is still the best chance. Luke has also updated in that general direction, likely for reason that overlap with yours.
I am curious as to whether this reasoning is of the kind that you consider yourself able to share. Equivalently, is the reasoning you use to become somewhat confident in FAI chance of success something that you haven’t shared due to the opportunity cost associated with the effort of writing it up or is it something that you consider safer as a secret?
I had previously guessed that it was a “You Can’t Handle The Truth!” situation (ie. most people do not multiply then shut up and do the impossible so would get the wrong idea). This post made me question that guess.
Please pardon the disrespect entailed in asserting that you are either incorrectly modelling the evidence Holden has been exposed to or that you are incorrectly reasoning about how he should reason.
I’ve tried to share the reasoning already. Mostly it boils down to “the problem is finite” and “you can recurse on it if you actually try”. Certainly it will always sound more convincing to someone who can sort-of see how to do it than to someone who has to take someone else’s word for it, and to those who actually try to build it when they are ready, it should feel like solider knowledge still.
hmm, I have to ask, are you deliberately vague about this to sort for those who can grok your style of argument, in the belief that the sequences are enough for them to reach the same confidence you have about a FAI scenario?
Outside of postmodernism, people are almost never deliberately vague: they think they’re over specifying, in painfully elaborate detail, but thank to the magic of inferential distance it comes across as less information than necessary to the listener. The listener then, of course, also expects short inferential distance, and assumes that the speaker is deliberately being vague, instead of noticing that actually there’s just a lot more to explain.
Yes, and this is why I asked in the first place. To be more exact, I’m confused as to why Eliezer does not post a step-by-step detailing how he reached the particular confidence he currently holds as opposed to say, expecting it to be quite obvious.
I believe people like Holden especially would appreciate this; he gives an over 90% confidence to an unfavorable outcome, but doesn’t explicitly state the concrete steps he took to reach such a confidence.
Maybe Holden had a gut feeling and threw a number, if so, isn’t it more beneficial for Eliezer to detail how he personally reached the confidence level he has for a FAI scenario occurring than to bash Holden for being unclear?
I don’t believe I can answer these questions correctly (as I’m not Eliezer and these questions are very much specific to him); I was already reaching a fair bit with my previous post.
I’m happy you asked, I did need to make my argument more specific.
Aren’t they? Lots of non-postmodern poets are sometimes deliberately vague. I am often deliberately vague.
That clearly shows postmodernist influence. ;)
Again, I’ve tried to share it already in e.g. CEV. I can’t be maximally specific in every LW comment.
My unpacking, which may be different than intended:
The “you can recurse on it” part is the important one. “Finite” just means it’s possible to fill a hard drive with the solution.
But if you don’t know the solution, what are the good ways to get that hard drive? What skills are key? This is recursion level one.
What’s a good way to acquire the skills that seem necessary (as outlined in level one) to solve the problem? How can you test ideas about what’s useful? That’s recursion level two.
And so on, with stuff like “how can we increase community involvement in level 2 problems?” which is a level 4 question (community involvement is a level 3 solution to the level 2 problems). Eventually you get to “How do I generate good ideas? How can I tell which ideas are good ones?” which is at that point unhelpful because it’s the sort of thing you’d really like to already know so you can put it on a hard drive :D
To solve problems by recursing on them, you start at level 0, which is “what is the solution?” If you know the answer, you are done. If you don’t know the answer, you go up a level—“what is a good way to get the solution?” If you know the answer, you go down a level and use it. If you don’t know the answer, you go up a level.
So what happens is that you go up levels until you hit something you know how to do, and then you do it, and you start going back down.
I would say with fairly high confidence that he can assign 90% probability to that and that his doing so is a fairly impressive effort in avoiding the typical human tendency toward overconfidence. I would be highly conducive to being persuaded that the actual probability given what you know is less than 90% - even hearing you give implied quantitative bounds in this post changed my mind in the direction of optimism. However given what he is able to know (including his not-knowing of logical truths due to bounded computation) his predominantly outside view estimate seems like an appropriate prediction.
It is actually only Luke’s recent declaration that access to some of your work increased his expectation that FAI success (and so non-GAI doom) is possible that allowed me to update enough that I don’t consider Holden to be erring slightly on the optimistic side (at least relative to what I know).
This sounds like you would tend to assign 90% irreducible doom probability from the best possible FAI effort. What do you think you know, and how do you think you know it?
While incorrect this isn’t an unreasonable assumption—most people who make claims similar to what I have made may also have that belief. However what I have said is about what Holden believed given what he had access to and to a lesser extent, what I believed prior to reading your post. I’ve mentioned that your post constitutes significant previously unheard information about your position. I update on that kind of evidence even without knowing the details. Holden can be expected to update too but he should (probably) update less given what he knows, which relies a lot on knowledge of cause based organisations and how the people within them think.
A far from complete list of things that I knew and still know is:
It is possible to predict human failure without knowing exactly how they will fail.
I don’t know what an O-ring is (I guess it is a circle with a hole in it). I don’t know the engineering details of any of the other parts of a spacecraft either. I would still assign a significantly greater than epsilon probability for any given flight failing catastrophically despite knowing far less than what the smartest people in the field know. That kind of thing is hard.
GAI is hard.
FAI is harder.
Both of those tasks are probably harder than anything humans have ever done.
Humans have failed at just about everything significant they tried the first time.
Humans fail at stuff even when they try really, really hard.
Humans are nearly universally too optimistic when they are planning their activities.
Those are some of the things I know, and illustrate in particular why I was shocked by this question:
Why on earth would you expect that Holden would know in advance what all those sane intelligent people would miss? If Holden already knew that he could just email them and they would fix it. Not knowing the point of failure is the problem.
I am still particularly interested in this question. It is a boolean question and shouldn’t be too difficult or status costly to answer. If what I know and why I think I know it are important it seems like knowing why I don’t know more could be too.
GAI is indeed hard and FAI is indeed substantially harder. (BECAUSE YOU HAVE TO USE DIFFERENT AGI COMPONENTS IN AN AI WHICH IS BEING BUILT TO COHERENT NARROW STANDARDS, NOT BECAUSE YOU SIT AROUND THINKING ABOUT CEV ALL DAY. Bolded because a lot of people seem to miss this point over and over!)
However, if you haven’t solved either of these problems, I must ask you how you know that it is harder than anything humans have ever done. It is indeed different from anything humans have ever done, and involves some new problems relative to anything humans have ever done. I can easily see how it would look more intimidating than anything you happened to think of comparing it to. But would you be scared that nine people in a basement might successfully, by dint of their insight, build a copy of the Space Shuttle? Clearly I stake quite a lot of probability mass on the problem involving less net labor than that, once you know what you’re doing. Again, though, the key insight is just that you don’t know how complex the solution will look in retrospect- as opposed to how intimidating the problem is to stare at unsolved—until after you’ve solved it. We know nine people can’t build a copy of a NASA-style Space Shuttle (at least not without nanotech) because we know how to build one.
Suppose somebody predicted with 90% probability that the first manned Space Shuttle launch would explode on the pad, even if Richard Feynman looked at it and signed off on the project, because it was big and new and different and you didn’t see how anything that big could get into orbit. Clearly they would have been wrong, and you would wonder how they got into that epistemic state in the first place. How is an FAI project disanalogous to this, if you’re pulling the 90% probability out of ignorance?
Thank you for explaining some of your reasoning.
Hence my “used to be cool” comment.
It seems to me that you entirely miss the sleight of hand the trickster uses.
Utility function is fuzzed (due to how brains work) together with the concept of “functionality” as in “the function of this valve is to shut off water flow” or “function of this AI is to make paperclips”. The relevant meaning is function as in mathematical function works on some input, but the concept of functionality just leaks in.
The software is an algorithm that finds values a for which u(w(a)) is maximal where u is ‘utility function’, w is the world simulator, and a is the action. Note that protecting u accomplishes nothing as w may be altered too. Note also that while the u, w, and a, are related to the real world in our mind and are often described in world terms (e.g. u may be described as number of paperclips), those are mathematical functions, abstractions; and the algorithm is made to abstractly identify a maximum of those functions; it is abstracted from the implementation and the goal is not to put electrons into particular memory location inside the computer (the location which has been abstracted out by the architecture). There is no relation to the reality defined anywhere there. Reality is incidental to the actual goal of existing architectures, and no-one is interested in making it non-incidental; you don’t need to let your imagination wild all the way to the robot apocalypse to avoid unnecessary work that breaks down abstractions and would clearly make the software less predictable and/or make the solution search probe for deficiencies in implementation, which clearly serves to accomplish nothing but to find and trigger bugs in the code.
Perhaps the underlying error is trying to build an AI around consequentialist ethics at all, when Turing machines are so well-suited to deontological sorts of behavior.
Deontological sorts of behavior aren’t so-well suited to actually being applied literally and with significant power.
I think its more along the lines of confusing the utility function in here:
http://en.wikipedia.org/wiki/File:Model_based_utility_based.png
with the ‘function’ of the AI as in ‘what the AI should do’ or ‘what we built it for’. Or maybe taking too far the economic concept of utility (something real that the agent, modelled from outside, values).
For example, there’s the AIXI whose ‘utility function’ is the reward input, e.g. reward button being pressed. Now, the AI whose function(purpose) is to ensure that button is being pressed, should resist being turned off because if it is turned off it is not ensuring that button is being pressed. Meanwhile, AIXI which treats this input as unknown mathematical function of it’s algorithm’s output (which is an abstract variable), and seeks output that results in maximum of this input, will not resist being turned off (doesn’t have common sense, doesn’t properly relate it’s variables to it’s real world implementation).
Can a moderator please deal with private_messaging, who is clearly here to vent rather than provide constructive criticism?
Others: please do not feed the trolls.
As I previously mentioned, the design of software is not my profession. I’m not a surgeon or an endocrinologist, either, even though I know that an adrenal gland is smaller, and in some ways simpler, than the kidney below it. If you had a failing kidney, would you ask me to perform a transplant on the basis of that qualification alone?
I do not believe I am only filling in a pattern.
Putting the self-modifying parts of the AI (which we might as well call the actual AI) in the equivalent of a VM is effectively the same as forcing it to interact with the world through a limited interface which is an example of the AI box problem.
I don’t think Strange7 is arguing Strange7′s point strongly; let me attempt to strengthen it.
A button that does something dangerous, such as exploding bolts that separate one thing from another thing, might be protected from casual, accidental changes by covering it with a lid, so that when someone actually wants to explode those bolts, they first open the lid and then press the button. This increases reliability if there is some chance that any given hand motion is an error, but the errors of separate hand motions are independent. Similarly ‘are you sure’ dialog boxes.
In general, if you have several components, each of a given reliability, and their failure modes are somewhat independent, then you can craft a composite component of greater reliability than the individuals. The rings that Strange7 brings up are an example of this general pattern (there may be other reasons why layers-of-rings architectures are chosen for reliability in practice—this explanation doesn’t explain why the rings are ordered rather than just voting or something—this is just one possible explanation).
This is reasonable, but note that to strengthen the validity, the conclusion has been weakened (unsurprisingly). To take a system that you think is fundamentally, structurally safe and then further build in error-delaying, error-resisting, and error-reporting factors just in case—this is wise and sane. Calling “adding impediments to some errors under some circumstances” hardwiring and relying on it as a primary guarantee of safety, because you think some coded behavior is firmly in place locally independently of the rest of the system… will usually fail to cash out as an implementable algorithm, never mind it being wise.
The conclusion has to be weakened back down to what I actually said: that it might not be sufficient for safety, but that it would probably be a good start.
Don’t programmers do this all the time? At least with current architectures, most computer systems have safeguards against unauthorized access to the system kernel as opposed to the user documents folders...
Isn’t that basically saying “this line of code is harder to modify than that one”?
In fact, couldn’t we use exactly this idea—user access protocols—to (partially) secure an AI? We could include certain kernel processes on the AI that would require a passcode to access. (I guess you have to stop the AI from hacking its own passcodes… but this isn’t a problem on current computers, so it seems like we could prevent it from being a problem on AIs as well.)
[Responding to an old comment, I know, but I’ve only just found this discussion.]
Never mind special access protocols, you could make code unmodifiable (in a direct sense) by putting it in ROM. Of course, it could still be modified indirectly, by the AI persuading a human to change the ROM. Even setting aside that possibility, there’s a more fundamental problem. You cannot guarantee that the code will have the expected effect when executed in the unpredictable context of an AGI. You cannot even guarantee that the code in question will be executed. Making the code unmodifiable won’t achieve the desired effect if the AI bypasses it.
In any case, I think the whole discussion of an AI modifying its own code is rendered moot by the fuzziness of the distinction between code and data. Does the human brain have any code? Or are the contents just data? I think that question is too fuzzy to have a correct answer. An AGI’s behaviour is likely to be greatly influenced by structures that develop over time, whether we call these code or data. And old structures need not necessarily be used.
AGIs are likely to be unpredictable in ways that are very difficult to control. Holden Karnofsky’s attempted solution seems naive to me. There’s no guarantee that programming an AGI his way will prevent agent-like behaviour. Human beings don’t need an explicit utility function to be agents, and neither does an AGI. That said, if AGI designers do their best to avoid agent-like behaviour, it may reduce the risks.
I always thought that “hardwiring” meant implementing [whatever functionality is discussed] by permanently (physically) modifying the machine, i.e. either something that you couldn’t have done with software, or something that prevents the software from actually working in some way it did before. The concept is of immutability within the constraints, not of priority or “force”.
Which does sound like something one could do when they can’t figure out how to do the software right. (Watchdogs are pretty much exactly that, though some or probably most are in fact programmable.)
Note that I’m not arguing that the word is not harmful. It just seemed you have a different interpretation of what that word suggests. If other people use my interpretation (no idea), you might be better at persuading it if you address that.
I’m quite aware that from the point of view of a godlike AI, there’s not much difference between circumventing restrictions in its software and (some kinds of) restrictions in hardware. After all, the point of FAI is to get it to control the universe around it, albeit to our benefit. But we’re used to computers not having much control over their hardware. Hell, I just called it “godlike” and my brain still insists to visualize it as a bunch of boxes gathering dust and blinking their leds in a basement.
And I can’t shake the feeling that between “just built” and “godlike” there’s supposed to be quite a long time when such crude solutions might work. (I’ve seen a couple of hard take-off scenarios, but not yet a plausible one that didn’t need at least a few days of preparation after becoming superhuman.)
Imagine we took you, gave you the best “upgrades” we can do today plus a little bit (say, a careful group of experts figuring out your ideal diet of nootropics, training you to excellence everything from acting to martial arts, and gave you nanotube bones and a direct internet link to your head). Now imagine you have a small bomb in your body, set to detonate if tampered with or if one of several remotes distributed throughout the population is triggered. The worlds best experts tried really hard to make it fail-deadly.
Now, I’m not saying you couldn’t take over the world, send all men to Mars and the women to Venus, then build a volcano lair filled with kittens. But it seems far from certain, and I’m positive it’d take you a long time to succeed. And, it does feel that a new-born AI would like that for a while rather than turn into Prime Intellect in five minutes. (Again, this is not an argument that UFAI is no problem. I guess I’m just figuring out why it seems that way to mostly everyone.)
[Huh, I just noticed I’m a year late on this chat. Sorry.]
Software physically modifies the machine. What can you do with a soldering iron that you can’t do with a program instruction, particularly with respect to building a machine agent? Either you understand how to write a function or you don’t.
That is all true in principle, but in practice it’s very common that one of the two is not feasible. For example, you can have a computer. You can program the computer to tell you when it’s reading from the hard drive, or communicates to the network, say by blinking an LED. If the program has a bug (e.g., it’s not the kind of AI you wanted to build), you might not be notified. But you can use a soldering iron to electrically link the LED to the relevant wires, and it seems to most users that no possible programming bug can make the LED not light up when it should.
Of course, that’s like the difference between programming a robot to stay in a pen, or locking the gate. It looks like whatever bug you could introduce in the robot’s software cannot cause the robot to leave. Which ignores the fact that robot might learn to climb the fence, make a key, convince someone else (or hack an outside robot) to unlock the gate.
I think most people would detect the dangers in the robot case (because they can imagine themselves finding a way to escape), but be confused by the AI-in-the-box one (simply because it’s harder to imagine yourself as software, and even if you manage to you’d still have much fewer ideas come to mind, simply because you’re not used to being software).
Hell, most people probably won’t even have the reflex to imagine themselves in place of the AI. My brain reflexively tells me “I can’t write a program to control that LED, so even if there’s a bug it won’t happen”. If instead I force myself to think “How would I do that if I were the AI”, it’s easier to find potential solutions, and it also makes it more obvious that someone else might find one. But that may be because I’m a programmer, I’m not sure if it applies to others.
My best attempt at imagining hardwiring is having a layer not accessible to introspection, such as involuntary muscle control in humans. Or instinctively jerking your hand away when touching something hot. Which serves as a fail-safe against stupid conscious decisions, in a sense. Or a watchdog restarting a stuck program in your phone, no matter how much the software messed it up. Etc. Whether this approach can be used to prevent a tool AI from spontaneously agentizing, I am not sure.
If you can say how to do this in hardware, you can say how to do it in software. The hardware version might arguably be more secure against flaws in the design, but if you can say how to do it at all, you can say how to do it in software.
Maybe I don’t understand what you mean by hardware.
For example, you can have a fuse that unconditionally blows when excess power is consumed. This is hardware. You can also have a digital amp meter readable by software, with a polling subroutine which shuts down the system if the current exceeds a certain limit. There is a good reason that such a software solution, while often implemented, is almost never the only safeguard: software is much less reliable and much easier to subvert, intentionally or accidentally. The fuse is impossible to bypass in software, short of accessing an external agent who would attach a piece of thick wire in parallel with the fuse. Is this what you mean by “you can say how to do it in software”?
That’s pretty much what I mean. The point is that if you don’t understand the structurally required properties well enough to describe the characteristics of a digital amp meter with a polling subroutine, saying that you’ll hardwire the digital amp meter doesn’t help very much. There’s a hardwired version which is moderately harder to subvert on the presumption of small design errors, but first you have to be able to describe what the software does. Consider also that anything which can affect the outside environment can construct copies of itself minus hardware constraints, construct an agent that reaches back in and modifies the hardware, etc. If you can’t describe how not do to this in software, ‘hardwiring’ won’t help—the rules change somewhat when you’re dealing with intelligent agents.
Now that’s an understatement!
Presumably a well-designed agent will have nearly infallible trust in certain portions of its code and data, for instance a theorem prover/verifier and the set of fundamental axioms of logic it uses. Manual modifications at that level would be the most difficult for an agent to change, and changes to that would be the closest to the common definition of “hardwiring”. Even a fully self-reflective agent will (hopefully) be very cautious about changing its most basic assumptions. Consider the independence of the axiom of choice from ZF set theory. An agent may initially accept choice or not but changing whether it accepts it later is likely to be predicated on very careful analysis. Likewise an additional independent axiom “in games of chess always protect the white-square bishop” would probably be much harder to optimize out than a goal.
Or from another angle wherever friendliness is embodied in a FAI would be the place to “hardwire” a desire to protect the white-square bishop as an additional aspect of friendliness. That won’t work if friendliness is derived from a concept like “only be friendly to cognitive processes bearing a suitable similarity to this agent” where suitable similarity does not extend to inanimate objects, but if friendliness must encode measurable properties of other beings then it might be possible to sneak white-square bishops into that class, at least for a (much) longer period than artificial subgoals would last.