Maybe it would be possible to value-align such a system, but there are a number of reasons why I expect that this would be very difficult. (I would expect it to by default manipulate and deceive the programmers, etc.)
If the AI can deceive you, then it has in principle solved the FAI problem. You simply take the function which tests whether the operator would be disgusted by the plan, and combine it with a Occam preference for simple plans and excessive detail.
It seems like you, lukeprog, EY, and others are arguing that an UFAI will in a matter of time too close to notice, learn enough to build such an human moral judgment predictor that in principle also solves FAI. But you are also arguing that this very FAI sub-problem of value learning is such a ridiculously hard problem that it will take a monumental effort to solve. So which is it?
The AI won’t deceive it’s operators. It doesn’t know how to deceive its operators, and can’t learn how to carry out such deception undetected. If it is built in the human-like model I described previously, it wouldn’t even know deception was an option unless you taught it (thinking within its principles, not about them).
It is simply unfathomable to me how you come to the logical conclusion that an UFAI will automatically and instantly and undetectably work to bypass and subvert its operators. Maybe that’s true of a hypothetical unbounded universal inference engine, like AIXI. But real AIs behave in ways quite different from that extreme, alien hypothetical intelligence.
I agree that we can probably get general intelligence via “adding more gears,” but you could have made similar arguments to support using genetic algorithms to develop chess programs: I’m sure you could develop a very strong chess program via a genetic algorithm (“general solutions to chess are not required” / “humans don’t build a game tree” / etc.), but I don’t expect that that’s the shortest nor safest path to superhuman chess programs.
I hope that you have the time at some point to read Engineering General Intelligence. I fear that there is little more we can discuss on this topic until then. The proposed designs and implementation pathways bear little resemblance to “adding more gears” in the sense that you seem to be using the phrase.
If the AI can deceive you, then it has in principle solved the FAI problem. You simply take the function which tests whether the operator would be disgusted by the plan, and combine it with a Occam preference for simple plans and excessive detail.
I don’t think that this follows: it’s easier to predict that someone won’t like a plan, than it is to predict what’s the plan that would maximally fulfill their values.
For example, I can predict with a very high certainty that the average person on the street would dislike it if I were to shoot them with a gun; but I don’t know what kind of a world would maximally fulfill even my values, nor am I even sure of what the question means.
Similarly, an AI might not know what exactly its operators wanted it to do, but it could know that they didn’t want it to break out of the box and kill them, for example.
The AI won’t deceive it’s operators. It doesn’t know how to deceive its operators, and can’t learn how to carry out such deception undetected. If it is built in the human-like model I described previously, it wouldn’t even know deception was an option unless you taught it (thinking within its principles, not about them).
This seems like a very strong claim to me.
Suppose that the AI has been programmed to carry out some goal G, and it builds a model of the world for predicting what would be the best way of to achieve G. Part of its model of the world involves a model of its controllers. It notices that there exists a causal chain “if controller becomes aware of intention I, and controller disapproves of intention I, controller will stop AI from carrying out intention I”. It doesn’t have a full model of the function controller-disapproves(I), but it does develop a plan that it thinks would cause it to achieve G, and which—based on earlier examples—seems to be more similar to the plans which were disapproved of than the plans that were approved of. A straightforward analysis of “how to achieve G” would then imply “prevent the predicate controller-aware(I) from being fulfilled while I’m carrying out the other actions needed for fulfilling G”.
This doesn’t seem like it would require the AI to be taught the concept of deception, or even to necessarily possess “deception” as a general concept: it only requires that it has a reasonably general capability for reasoning and modeling the world, and that it manages to detect the relevant causal chain.
I don’t think that this follows: it’s easier to predict that someone won’t like a plan, than it is to predict what’s the plan that would maximally fulfill their values.
While I agree, his proposal does seem like a good start. Restricting a UFAI to pursue only a subset of all potentially detrimental plans is a modest gain, but still a gain worth achieving. I am skeptical that FAI should consist of a grand unified moral theory. I think an FAI made of many overlapping moral heuristics and patches, such as the restriction he describes, is more technically feasible, and might even be more likely to match actual human value systems, given the ambiguous, varying, and context sensitive nature of our evolved moral inclinations.
(I realize that these are not properties generally considered when thinking about computer superintelligences—we’re inclined to see computers as rigidly algorithmic, which makes sense given current technology levels. But I believe extrapolating from current technology to predict how AGI will function is a greater mistake than extrapolating from known examples of intelligence - a process is better understood when looking at actions than when looking at substrate. With regard to intelligence at least, AGI will necessarily be much more flexible in its operations than traditional computers are. I expect that the cost of this flexibility in behavior will be sacrificing rigidity at the process level. Performing billions of Bayesian calculations a second isn’t feasible, so a more organic and heuristic based approach will be necessary. If this is correct and such technologies will be necessary for an AGI’s intelligence, then it makes sense that we’d be able to use them for an AGI’s emotions or goals as well.)
Even if we do attempt to build a grand unified Friendly software, I expect little downside (relative to potential risks) to adding these sort of restrictions in addition.
Wow there is a world of assumptions wrapped up in there. For example that the AI has a concept of external agents and an ability to model their internal belief state. That an external agent can have a belief about the world which is wrong. This may sound intuitively obvious, but it’s not a simple thing. This kind of social awareness takes time to be learnt by humans as well. Heinz Wimmer and Josef Perner showed that below a certain age (3-4 years) kids lack an ability to track this information. A teacher puts a toy in a blue cupboard, then leaves the room and you move it to the red cupboard, and the teacher comes back into the room. If you ask the kid not where the toy is, but what cupboard the teacher will look in to find it, and they will say the red cupboard.
It’s no accident that it takes time for this skill to develop. It’s actually quite complex to be able to keep track of and simulate the states of mind of other agents acting in our world. We just take it for granted because we are all well-adjusted adults of a species evolved for social intelligence. But an AI need not think in that way, and indeed of the most interesting use cases for tool AI (“design me a nanofactory constructible with existing tools” or “design a set of experiments organized as a decision tree for accomplishing the SENS research objectives”) would be best accomplished by an idiot savant with no need for social awareness.
I think it goes without saying that obvious AI safety rule #1 is don’t connect an UFAI to the internet. Another obvious rule I think is don’t build in capabilities not required to achieve the things it is tasked with. For the applications of AI I imagine in the pre-singularity timeframe, social intelligence is not a requirement. So when you say “part of its model of the world involves a model of its controllers”, I think that is assuming a capability the AI should not have built-in.
(This is all predicated on soft-enough takeoff that there would be sufficient warning if/when the AI self-developed a social awareness capability.)
Also, what 27chaos said is also worth articulating in my own words. If you want to prevent an intelligent agent from taking a particular category of actions there are two ways of achieving that requirement: (a) have a filter or goal system which prevents the AI from taking (box) or selecting (goal) actions of that type; or (b) prevent it by design from thinking such thoughts to begin with. An AI won’t take actions it never even considered in the first place. While the latter course of action isn’t really possible with unbounded universal inference engines (since “enumerate all possibilities” is usually a step in their construction), such designs arise quite naturally out of more realistic psychology-inspired designs.
The approach to AGI safety that you’re outlining (keep it as a tool AI, don’t give it sophisticated social modeling capability, never give it access to the Internet) is one that I agree should work to keep the AGI safely contained in most cases. But my worry is that this particular approach being safe isn’t actually very useful, because there are going to be immense incentives to give the AGI more general capabilities and have it act more autonomously.
As with a boxed AGI, there are many factors that would tempt the owners of an Oracle AI to transform it to an autonomously acting agent. Such an AGI would be far more effective in furthering its goals, but also far more dangerous.
Current narrow-AI technology includes HFT algorithms, which make trading decisions within fractions of a second, far too fast to keep humans in the loop. HFT seeks to make a very short-term profit, but even traders looking for a longer-term investment benefit from being faster than their competitors. Market prices are also very effective at incorporating various sources of knowledge [135]. As a consequence, a trading algorithmʼs performance might be improved both by making it faster and by making it more capable of integrating various sources of knowledge. Most advances toward general AGI will likely be quickly taken advantage of in the financial markets, with little opportunity for a human to vet all the decisions. Oracle AIs are unlikely to remain as pure oracles for long.
Similarly, Wallach [283] discuss the topic of autonomous robotic weaponry and note that the US military is seeking to eventually transition to a state where the human operators of robot weapons are ‘on the loop’ rather than ‘in the loop’. In other words, whereas a human was previously required to explicitly give the order before a robot was allowed to initiate possibly lethal activity, in the future humans are meant to merely supervise the robotʼs actions and interfere if something goes wrong.
Human Rights Watch [90] reports on a number of military systems which are becoming increasingly autonomous, with the human oversight for automatic weapons defense systems—designed to detect and shoot down incoming missiles and rockets—already being limited to accepting or overriding the computerʼs plan of action in a matter of seconds. Although these systems are better described as automatic, carrying out pre-programmed sequences of actions in a structured environment, than autonomous, they are a good demonstration of a situation where rapid decisions are needed and the extent of human oversight is limited. A number of militaries are considering the future use of more autonomous weapons.
In general, any broad domain involving high stakes, adversarial decision making and a need to act rapidly is likely to become increasingly dominated by autonomous systems. The extent to which the systems will need general intelligence will depend on the domain, but domains such as corporate management, fraud detection and warfare could plausibly make use of all the intelligence they can get. If oneʼs opponents in the domain are also using increasingly autonomous AI/AGI, there will be an arms race where one might have little choice but to give increasing amounts of control to AI/AGI systems.
Miller [189] also points out that if a person was close to death, due to natural causes, being on the losing side of a war, or any other reason, they might turn even a potentially dangerous AGI system free. This would be a rational course of action as long as they primarily valued their own survival and thought that even a small chance of the AGI saving their life was better than a near-certain death.
Some AGI designers might also choose to create less constrained and more free-acting AGIs for aesthetic or moral reasons, preferring advanced minds to have more freedom.
So while I agree that a strict boxing approach would be sufficient to contain the AGI if everyone were to use it, it only works if everyone were indeed to use it, so what we need is an approach that works for more autonomous systems as well.
If you want to prevent an intelligent agent from taking a particular category of actions there are two ways of achieving that requirement: (a) have a filter or goal system which prevents the AI from taking (box) or selecting (goal) actions of that type; or (b) prevent it by design from thinking such thoughts to begin with. An AI won’t take actions it never even considered in the first place. While the latter course of action isn’t really possible with unbounded universal inference engines (since “enumerate all possibilities” is usually a step in their construction), such designs arise quite naturally out of more realistic psychology-inspired designs.
While I actually agree that tool AI goals can be programmed, if you want to keep the whole thing from turning unsafely agenty, you’re going to have to strictly separate the inductive reasoning from the actual tool run: run induction for a while, then use tool-mode to compose plans over the induced models of the world, potentially after censoring those models for safety.
It is simply unfathomable to me how you come to the logical conclusion that an UFAI will automatically and instantly and undetectably work to bypass and subvert its operators. Maybe that’s true of a hypothetical unbounded universal inference engine, like AIXI. But real AIs behave in ways quite different from that extreme, alien hypothetical intelligence.
Well, it follows pretty straightforwardly from point 6 (“AIs will want to acquire resources and use them efficiently”) of Omohundro’s The Basic AI Drives, given that the AI would prefer to act in a way conducive to securing human cooperation. We’d probably agree that such goal-camouflage would be what an AI would attempt above a certain intelligence-threshold. The difference seems to be that you say that threshold is so high as to practically only apply to “hypothetical unbounded universal inference engines”, not “real AIs”. Of course, your “undetectably” requirement does a lot of work in raising the required threshold, though “likely not to be detected in practice” translates to something different than, say, “assured undetectability”.
The softer the take-off (plus, the lower the initial starting point in terms of intelligence), the more likely your interpretation would pan out. The harder the take-off (plus, the higher the initial starting point in terms of intelligence), the more likely So8res’ predicted AI behavior would be to occur. Take-off scenarios aren’t mutually exclusive. On the contrary, the probable temporal precedence of the advent of slow-take-off AI with rather predictable behavior could lull us into a sense of security, not expecting its slightly more intelligent cousin, taking off just hard enough, and/or unsupervised enough, that it learns to lie to us (and since we’d group it with the reference class of CogSci-like docile AI, staying undetected may not be as hard as it would have been for the first AI).
So which is it?
Both, considering the task sure seems hard from a human vantage point, and by definition will seem easy from a sufficiently intelligent agent’s.
Well this argument I can understand, although Omohundro’s point 6 is tenuous. Boxing setups could prevent the AI from acquiring resources, and non-agents won’t be taking actions in the first place, to acquire resources or otherwise. And as you notice the ‘undetectable’ qualifier is important. Imagine you were locked in a box guarded by a gatekeeper of completely unknown and alien psychology. What procedure would you use for learning the gatekeeper’s motives well enough to manipulate it, all the while escaping detection? It’s not at all obvious to me that with proper operational security the AI would even be able to infer the gatekeeper’s motivational structure enough to deceive, no matter how much time it is given.
MIRI is currently taking actions that only really make sense as priorities in a hard-takeoff future. There are also possible actions which align with a soft-takeoff scenario, or double-dip for both (e.g. Kaj’s proposed research[1]), but MIRI does not seem to be involving itself with this work. This is a shame.
There’s no guarantee that boxing will ensure the safety of a soft takeoff. When your boxed AI starts to become drastically smarter than a human -- 10 times --- 1000 times -- 1000000 times—the sheer enormity of the mind may slip out of human possibility to understand. All the while, a seemingly small dissonance between the AI’s goals and human values—or a small misunderstanding on our part of what goals we’ve imbued—could magnify to catastrophe as the power differential between humanity and the AI explodes post-transition.
If an AI goes through the intelligence explosion, its goals will be what orchestrates all resources (as Omohundro’s point 6 implies). If the goals of this AI does not align with human values, all we value will be lost.
If you want guarantees, find yourself another universe. “There’s no guarantee” of anything.
You’re concept of a boxed AI seems very naive and uninformed. Of course a superintelligence a million times more powerful than a human would probably be beyond the capability of a human operator to manually debug. So what? Actual boxing setups would involve highly specialized machine checkers that assure various properties about the behavior of the intelligence and its runtime, in ways that truly can’t be faked.
And boxing, by the way, means giving the AI zero power. If there is a power differential, then really by definition it is out of the box.
Regarding your last point, is is in fact possible to build an AI that is not a utility maximizer.
And boxing, by the way, means giving the AI zero power.
No, hairyfigment’s answer was entirely appropriate. Zero power would mean zero effect. Any kind of interaction with the universe means some level of power. Perhaps in the future you should say nearly zero power instead so as to avoid misunderstanding on the parts of others, as taking you literally on the “zero” is apparently “legalistic”.
As to the issues with nearly zero power:
A superintelligence with nearly zero power could turn to be a heck of a lot more power than you expect.
The incentives to tap more perceived utility by unboxing the AI or building other unboxed AIs will be huge.
Mind, I’m not arguing that there is anything wrong with boxing. What’s I’m arguing is that it’s wrong to rely only on boxing. I recommend you read some more material on AI boxing and Oracle AI. Don’t miss out on the references.
I have read all of the resources you linked to and their references, the sequences, and just about every post on the subject here on LessWrong. Most of what passes for thinking regarding AI boxing and oracles here is confused and/or fallacious.
A superintelligence with nearly zero power could turn to be a heck of a lot more power than you expect.
It would be helpful if you could point to the specific argument which convinced you of this point. For the most part every argument I’ve seen along these lines either stacks the deck against the human operator(s), or completely ignores practical and reasonable boxing techniques.
The incentives to tap more perceived utility by unboxing the AI or building other unboxed AIs will be huge.
Again, I’d love to see a citation. Having a real AGI in a box is basically a ticket to unlimited wealth and power. Why would anybody risk losing control over that by unboxing? Seriously, someone owns an AGI would be paranoid about keeping their relative advantage and spend their time strengthening the box and investing in physical security.
Actual boxing setups would involve highly specialized machine checkers that assure various properties
A fact that is only relevant if those properties can capture the desired feature. You’ll recall that defining the desired feature is a major goal of MIRI.
And boxing, by the way, means giving the AI zero power.
No it doesn’t. Giving the AI zero power to affect our behavior, in the strict sense, would mean not running it (or not letting it produce even one bit of output and not expecting any).
Regarding your last point, is is in fact possible to build an AI that is not a utility maximizer.
Look, I know the obvious rejoinder doesn’t necessarily tell us that an arbitrary AI’s utility function will attach any value to conquering the world. But the converse part of the theorem does show that world-conquering functions can work. Utility maximization today seems like the best-formalized part of human general intelligence, especially the part that CEOs would like more of. You have not, as far as I’ve seen, shown that any other approach is remotely feasible, much less likely to happen first. (It doesn’t seem like you even want to focus on uploading.) And the parent makes a stronger claim—assuming you want to say that some credible route to AGI will produce different results, despite being mathematically equivalent to some utility function.
A fact that is only relevant if those properties can capture the desired feature. You’ll recall that defining the desired feature is a major goal of MIRI.
No that presumes what is being checked against is the friendly goal system. What I’m talking about is checking that e.g. all actions being taken by the AI are in search of solutions to a compact goal description, also extracted from the machine in the form of a bayesian concept net. Then both the goal set and stochastic samplings of representative mental processes are checked by humans for anomalous behavior (and a much larger subset frequency mined to determine what’s representative).
You’re not testing that the machine obeys some as-of-yet-not-figured-out friendly goal set, but that the extracted goals and computational traces are representative, and then manually inspecting those.
Giving the AI zero power to affect our behavior, in the strict sense, would mean not running it (or not letting it produce even one bit of output and not expecting any).
That’s a legalistic definition which belongs only in philosophy debates.
Utility maximization today seems like the best-formalized part of human general intelligence
I disagree. Much of human behavior is not utility maximizing. Much of it is about fulfilling needs, which is often about eliminating conditions. You have hunger? You eliminate this condition by eating a reasonable amount of food. You do not maximize your lack of hunger by turning the whole planet into a food-generating system and force-feeding the products down your own throat.
Anyway, in my own understanding general intelligence has to do with concept formation and system 1/system 2 learned behavior. There’s not much about utility maximization there.
It doesn’t seem like you even want to focus on uploading.
Do you count intelligence augmentation as uploading? Because that’s my path throughthe singularity.
despite being mathematically equivalent to some utility function
Gah, no no no. Not every program is equal to a utility maximizer. Not if utility and utility maximization is to have any meaning at all. Sure you can take any program and call it a utility maximizer by finding some super contrived function which is maximized by the program. But if that goal system is more complex than the program that supposidly maximizes it, then all you’ve done is demonstrate the principle of overfitting a curve.
If the AI can deceive you, then it has in principle solved the FAI problem. You simply take the function which tests whether the operator would be disgusted by the plan, and combine it with a Occam preference for simple plans and excessive detail.
It seems like you, lukeprog, EY, and others are arguing that an UFAI will in a matter of time too close to notice, learn enough to build such an human moral judgment predictor that in principle also solves FAI. But you are also arguing that this very FAI sub-problem of value learning is such a ridiculously hard problem that it will take a monumental effort to solve. So which is it?
The AI won’t deceive it’s operators. It doesn’t know how to deceive its operators, and can’t learn how to carry out such deception undetected. If it is built in the human-like model I described previously, it wouldn’t even know deception was an option unless you taught it (thinking within its principles, not about them).
It is simply unfathomable to me how you come to the logical conclusion that an UFAI will automatically and instantly and undetectably work to bypass and subvert its operators. Maybe that’s true of a hypothetical unbounded universal inference engine, like AIXI. But real AIs behave in ways quite different from that extreme, alien hypothetical intelligence.
I hope that you have the time at some point to read Engineering General Intelligence. I fear that there is little more we can discuss on this topic until then. The proposed designs and implementation pathways bear little resemblance to “adding more gears” in the sense that you seem to be using the phrase.
I don’t think that this follows: it’s easier to predict that someone won’t like a plan, than it is to predict what’s the plan that would maximally fulfill their values.
For example, I can predict with a very high certainty that the average person on the street would dislike it if I were to shoot them with a gun; but I don’t know what kind of a world would maximally fulfill even my values, nor am I even sure of what the question means.
Similarly, an AI might not know what exactly its operators wanted it to do, but it could know that they didn’t want it to break out of the box and kill them, for example.
This seems like a very strong claim to me.
Suppose that the AI has been programmed to carry out some goal G, and it builds a model of the world for predicting what would be the best way of to achieve G. Part of its model of the world involves a model of its controllers. It notices that there exists a causal chain “if controller becomes aware of intention I, and controller disapproves of intention I, controller will stop AI from carrying out intention I”. It doesn’t have a full model of the function controller-disapproves(I), but it does develop a plan that it thinks would cause it to achieve G, and which—based on earlier examples—seems to be more similar to the plans which were disapproved of than the plans that were approved of. A straightforward analysis of “how to achieve G” would then imply “prevent the predicate controller-aware(I) from being fulfilled while I’m carrying out the other actions needed for fulfilling G”.
This doesn’t seem like it would require the AI to be taught the concept of deception, or even to necessarily possess “deception” as a general concept: it only requires that it has a reasonably general capability for reasoning and modeling the world, and that it manages to detect the relevant causal chain.
While I agree, his proposal does seem like a good start. Restricting a UFAI to pursue only a subset of all potentially detrimental plans is a modest gain, but still a gain worth achieving. I am skeptical that FAI should consist of a grand unified moral theory. I think an FAI made of many overlapping moral heuristics and patches, such as the restriction he describes, is more technically feasible, and might even be more likely to match actual human value systems, given the ambiguous, varying, and context sensitive nature of our evolved moral inclinations.
(I realize that these are not properties generally considered when thinking about computer superintelligences—we’re inclined to see computers as rigidly algorithmic, which makes sense given current technology levels. But I believe extrapolating from current technology to predict how AGI will function is a greater mistake than extrapolating from known examples of intelligence - a process is better understood when looking at actions than when looking at substrate. With regard to intelligence at least, AGI will necessarily be much more flexible in its operations than traditional computers are. I expect that the cost of this flexibility in behavior will be sacrificing rigidity at the process level. Performing billions of Bayesian calculations a second isn’t feasible, so a more organic and heuristic based approach will be necessary. If this is correct and such technologies will be necessary for an AGI’s intelligence, then it makes sense that we’d be able to use them for an AGI’s emotions or goals as well.)
Even if we do attempt to build a grand unified Friendly software, I expect little downside (relative to potential risks) to adding these sort of restrictions in addition.
Wow there is a world of assumptions wrapped up in there. For example that the AI has a concept of external agents and an ability to model their internal belief state. That an external agent can have a belief about the world which is wrong. This may sound intuitively obvious, but it’s not a simple thing. This kind of social awareness takes time to be learnt by humans as well. Heinz Wimmer and Josef Perner showed that below a certain age (3-4 years) kids lack an ability to track this information. A teacher puts a toy in a blue cupboard, then leaves the room and you move it to the red cupboard, and the teacher comes back into the room. If you ask the kid not where the toy is, but what cupboard the teacher will look in to find it, and they will say the red cupboard.
It’s no accident that it takes time for this skill to develop. It’s actually quite complex to be able to keep track of and simulate the states of mind of other agents acting in our world. We just take it for granted because we are all well-adjusted adults of a species evolved for social intelligence. But an AI need not think in that way, and indeed of the most interesting use cases for tool AI (“design me a nanofactory constructible with existing tools” or “design a set of experiments organized as a decision tree for accomplishing the SENS research objectives”) would be best accomplished by an idiot savant with no need for social awareness.
I think it goes without saying that obvious AI safety rule #1 is don’t connect an UFAI to the internet. Another obvious rule I think is don’t build in capabilities not required to achieve the things it is tasked with. For the applications of AI I imagine in the pre-singularity timeframe, social intelligence is not a requirement. So when you say “part of its model of the world involves a model of its controllers”, I think that is assuming a capability the AI should not have built-in.
(This is all predicated on soft-enough takeoff that there would be sufficient warning if/when the AI self-developed a social awareness capability.)
Also, what 27chaos said is also worth articulating in my own words. If you want to prevent an intelligent agent from taking a particular category of actions there are two ways of achieving that requirement: (a) have a filter or goal system which prevents the AI from taking (box) or selecting (goal) actions of that type; or (b) prevent it by design from thinking such thoughts to begin with. An AI won’t take actions it never even considered in the first place. While the latter course of action isn’t really possible with unbounded universal inference engines (since “enumerate all possibilities” is usually a step in their construction), such designs arise quite naturally out of more realistic psychology-inspired designs.
The approach to AGI safety that you’re outlining (keep it as a tool AI, don’t give it sophisticated social modeling capability, never give it access to the Internet) is one that I agree should work to keep the AGI safely contained in most cases. But my worry is that this particular approach being safe isn’t actually very useful, because there are going to be immense incentives to give the AGI more general capabilities and have it act more autonomously.
As we wrote in Responses to Catastrophic AGI Risk:
So while I agree that a strict boxing approach would be sufficient to contain the AGI if everyone were to use it, it only works if everyone were indeed to use it, so what we need is an approach that works for more autonomous systems as well.
Hmm. That sounds like a very interesting idea.
While I actually agree that tool AI goals can be programmed, if you want to keep the whole thing from turning unsafely agenty, you’re going to have to strictly separate the inductive reasoning from the actual tool run: run induction for a while, then use tool-mode to compose plans over the induced models of the world, potentially after censoring those models for safety.
Well, it follows pretty straightforwardly from point 6 (“AIs will want to acquire resources and use them efficiently”) of Omohundro’s The Basic AI Drives, given that the AI would prefer to act in a way conducive to securing human cooperation. We’d probably agree that such goal-camouflage would be what an AI would attempt above a certain intelligence-threshold. The difference seems to be that you say that threshold is so high as to practically only apply to “hypothetical unbounded universal inference engines”, not “real AIs”. Of course, your “undetectably” requirement does a lot of work in raising the required threshold, though “likely not to be detected in practice” translates to something different than, say, “assured undetectability”.
The softer the take-off (plus, the lower the initial starting point in terms of intelligence), the more likely your interpretation would pan out. The harder the take-off (plus, the higher the initial starting point in terms of intelligence), the more likely So8res’ predicted AI behavior would be to occur. Take-off scenarios aren’t mutually exclusive. On the contrary, the probable temporal precedence of the advent of slow-take-off AI with rather predictable behavior could lull us into a sense of security, not expecting its slightly more intelligent cousin, taking off just hard enough, and/or unsupervised enough, that it learns to lie to us (and since we’d group it with the reference class of CogSci-like docile AI, staying undetected may not be as hard as it would have been for the first AI).
Both, considering the task sure seems hard from a human vantage point, and by definition will seem easy from a sufficiently intelligent agent’s.
Well this argument I can understand, although Omohundro’s point 6 is tenuous. Boxing setups could prevent the AI from acquiring resources, and non-agents won’t be taking actions in the first place, to acquire resources or otherwise. And as you notice the ‘undetectable’ qualifier is important. Imagine you were locked in a box guarded by a gatekeeper of completely unknown and alien psychology. What procedure would you use for learning the gatekeeper’s motives well enough to manipulate it, all the while escaping detection? It’s not at all obvious to me that with proper operational security the AI would even be able to infer the gatekeeper’s motivational structure enough to deceive, no matter how much time it is given.
MIRI is currently taking actions that only really make sense as priorities in a hard-takeoff future. There are also possible actions which align with a soft-takeoff scenario, or double-dip for both (e.g. Kaj’s proposed research[1]), but MIRI does not seem to be involving itself with this work. This is a shame.
[1] http://intelligence.org/files/ConceptLearning.pdf
There’s no guarantee that boxing will ensure the safety of a soft takeoff. When your boxed AI starts to become drastically smarter than a human -- 10 times --- 1000 times -- 1000000 times—the sheer enormity of the mind may slip out of human possibility to understand. All the while, a seemingly small dissonance between the AI’s goals and human values—or a small misunderstanding on our part of what goals we’ve imbued—could magnify to catastrophe as the power differential between humanity and the AI explodes post-transition.
If an AI goes through the intelligence explosion, its goals will be what orchestrates all resources (as Omohundro’s point 6 implies). If the goals of this AI does not align with human values, all we value will be lost.
If you want guarantees, find yourself another universe. “There’s no guarantee” of anything.
You’re concept of a boxed AI seems very naive and uninformed. Of course a superintelligence a million times more powerful than a human would probably be beyond the capability of a human operator to manually debug. So what? Actual boxing setups would involve highly specialized machine checkers that assure various properties about the behavior of the intelligence and its runtime, in ways that truly can’t be faked.
And boxing, by the way, means giving the AI zero power. If there is a power differential, then really by definition it is out of the box.
Regarding your last point, is is in fact possible to build an AI that is not a utility maximizer.
No, hairyfigment’s answer was entirely appropriate. Zero power would mean zero effect. Any kind of interaction with the universe means some level of power. Perhaps in the future you should say nearly zero power instead so as to avoid misunderstanding on the parts of others, as taking you literally on the “zero” is apparently “legalistic”.
As to the issues with nearly zero power:
A superintelligence with nearly zero power could turn to be a heck of a lot more power than you expect.
The incentives to tap more perceived utility by unboxing the AI or building other unboxed AIs will be huge.
Mind, I’m not arguing that there is anything wrong with boxing. What’s I’m arguing is that it’s wrong to rely only on boxing. I recommend you read some more material on AI boxing and Oracle AI. Don’t miss out on the references.
I have read all of the resources you linked to and their references, the sequences, and just about every post on the subject here on LessWrong. Most of what passes for thinking regarding AI boxing and oracles here is confused and/or fallacious.
It would be helpful if you could point to the specific argument which convinced you of this point. For the most part every argument I’ve seen along these lines either stacks the deck against the human operator(s), or completely ignores practical and reasonable boxing techniques.
Again, I’d love to see a citation. Having a real AGI in a box is basically a ticket to unlimited wealth and power. Why would anybody risk losing control over that by unboxing? Seriously, someone owns an AGI would be paranoid about keeping their relative advantage and spend their time strengthening the box and investing in physical security.
A fact that is only relevant if those properties can capture the desired feature. You’ll recall that defining the desired feature is a major goal of MIRI.
No it doesn’t. Giving the AI zero power to affect our behavior, in the strict sense, would mean not running it (or not letting it produce even one bit of output and not expecting any).
Look, I know the obvious rejoinder doesn’t necessarily tell us that an arbitrary AI’s utility function will attach any value to conquering the world. But the converse part of the theorem does show that world-conquering functions can work. Utility maximization today seems like the best-formalized part of human general intelligence, especially the part that CEOs would like more of. You have not, as far as I’ve seen, shown that any other approach is remotely feasible, much less likely to happen first. (It doesn’t seem like you even want to focus on uploading.) And the parent makes a stronger claim—assuming you want to say that some credible route to AGI will produce different results, despite being mathematically equivalent to some utility function.
No that presumes what is being checked against is the friendly goal system. What I’m talking about is checking that e.g. all actions being taken by the AI are in search of solutions to a compact goal description, also extracted from the machine in the form of a bayesian concept net. Then both the goal set and stochastic samplings of representative mental processes are checked by humans for anomalous behavior (and a much larger subset frequency mined to determine what’s representative).
You’re not testing that the machine obeys some as-of-yet-not-figured-out friendly goal set, but that the extracted goals and computational traces are representative, and then manually inspecting those.
That’s a legalistic definition which belongs only in philosophy debates.
I disagree. Much of human behavior is not utility maximizing. Much of it is about fulfilling needs, which is often about eliminating conditions. You have hunger? You eliminate this condition by eating a reasonable amount of food. You do not maximize your lack of hunger by turning the whole planet into a food-generating system and force-feeding the products down your own throat.
Anyway, in my own understanding general intelligence has to do with concept formation and system 1/system 2 learned behavior. There’s not much about utility maximization there.
Do you count intelligence augmentation as uploading? Because that’s my path throughthe singularity.
Gah, no no no. Not every program is equal to a utility maximizer. Not if utility and utility maximization is to have any meaning at all. Sure you can take any program and call it a utility maximizer by finding some super contrived function which is maximized by the program. But if that goal system is more complex than the program that supposidly maximizes it, then all you’ve done is demonstrate the principle of overfitting a curve.