Whether you have a background in computer science is relevant to ongoing debates at MIRI about “How likely are people to believe X?” That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question, but if one tries to cite your case as an example of what people believe, others shall say, “But Jiro is not a computer scientist! Perhaps computer scientists, as opposed to the general population, are unlikely to believe that.” Of course if you are a computer scientist they will say, “But Jiro is not an elite computer scientist!”, and if you were an elite computer scientist they would say, “Elite computer scientists don’t currently take the issue seriously enough to think about it properly, but this condition will reverse after X happens and causes everyone to take AI more seriously after which elite computer scientists will get the question right” but even so it would be useful data.
That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question
Some off-the-cuff thoughts on why “a superintelligence dumb enough to misinterpret what we mean” may be a contradiction in terms, given the usual meaning of superintelligence:
Intelligence is near-synonymous with “able to build accurate models and to update those models accurately”, with ‘higher intelligence’ denoting a combination of “faster model-building / updating” and/or “less prone to systematic / random errors”.
‘Super’ as a qualifier is usually applied on both dimensions, i.e. “faster and more accurately”. While this seems more like a change in degree (one intelligence hypothesis, a devoted immortal fool with an endless supply of paper and pencils could simulate the world), it also often is a change in kind, since in practice there always are resource-constraints (unless Multivax reverses entropy), often relevant enough to bar a slower-modeling agent from achieving its goals within the given constraints.
“Able to build accurate models and to update those models accurately”, then, proportionally increases “powerful, probably able to pursue its goals effectively, conditional on those goals being related to the accurate models”.
Given a high degree of the former, by definition it is not exactly very hard to acquire and emulate the shared background on which inter-human understanding is built. For an AI, understanding humans would be relevant near-regardless of its actual goals; accurate models of humans as the sine-qua-non for e.g. breaking out of the AI box. Being able to build such models quickly and accurately is what classifies the agent as “superintelligent” in the first place! If there was no incentive for the agent to model humans at all, why would there be interactions with humans, such as the human asking the agent to “rescue grandma from the burning building”? The agent, when encountering rocks and precious minerals, will probably seek models reflecting a deep understanding of those. It will do the same when encountering humans.
See, I’m d’accord with statements such as “less intelligent agents would be expected to misinterpret what we mean”, but a superintelligent agent—i.e. an agent good at building accurate models --, should by its definition by able to understand human-level intentions. If it does not, then in that respect, I wouldn’t call it a superintelligent agent.
In addition, I’d question who’d call a domain-limited expert system which is great with models only on some small subject-spectrum, but evidently abysmal with building models relevant to its goals in other respects, a “superintelligent agent”, with its connotations of general intelligence. Does the expression “a superintelligent chessbot” make sense? Or saying “x is a superintelligent human, except for doing fractions, which he absolutely cannot do”?
Before you label me an idiot who’d expect the AI to fall in love with a human princess on top of the Empire State building, allow me to stress I’m not talking about the goal-specification phase, for which no shared basis for interpretation can be expected. “The humans constructed me to stop cancer. Now, I have come to understand that humans want that in order to live longer, and I use that and all my other refined models of the human psyche to fulfill my goal. Which I do: I stop cancer, by wiping out humanity.” (Refined models cannot be used to change terminal goals, only to choose actions and subgoals to attain those goals.) More qualifications apply:
At first, such human-related models would of course be quite lacking, but probably converge fast (by definition). The problem remains of why the superintelligent agent would do what the monkeys intend it to (nevermind what they explicitly told it to), and how the monkeys could make sure of that in a way which survives self-modification. The intend-it-to / programmed-it-to dichotomy remains a problem then, since terminal goals are presumably not subject to updating/reflection, at least not as part of the ‘superintelligence’ attribute.
tl;dr: A superintelligent agent’s specified goals must be airtightly constructed, but if those include “do what the human intends, not what he says”, then the step from “words” to “intent” should be trivial. (Argument that superintelligent agents will not misinterpret humans does not apply to the goal-setting phase!)
ETA: News at 11 - News at 11 - Kawoomba solved FAI: use / leverage the foomed AIs superior model building ability (which entails that it knows what we want better than we do) by letting it solve the problem: let its initial (invariant) goal be to develop superior models of anything it encounters without affecting it (which should be easier to formalize than “friendliness”), then time that such that it will ask for “ENTER NEW GOALS” once it already established its superior models, at which point you simply tell it “ok glass, use as your new goal system that which I’d most want you to use”.
It’d work great if ‘affecting’ wasn’t secretly a Magical Category based on how you partition physical states into classes that are instrumentally equivalent relative to your end goals.
Point. I’d still expect some variant of “keep (general) interference minimal / do not perturb human activity / build your models using the minimal actions possible” to be easier to formalize than human friendliness, wouldn’t you?
The trouble is that communicating with a human or helping them build the real FAI in any way is going to strongly perturb the world. So actually getting anything useful this way requires solving the problem of which changes to humans, and consequent changes to the world, are allowed to result from your communication-choices.
Except it’s not, as far as the artificial agent is concerned:
Its goals are strictly limited to “develop your models using the minimal actions possible [even ‘just parse the internet, do not use anything beyond wget’ could suffice], after x number of years have passed, accept new goals from y source.” The new goals could be anything. (It could even be a boat!).
The usefulness regarding FAI becomes evident only at that latter stage, stemming from the foom’ed AI’s models being used to parse the new goals of “do that which I’d want you to do”. It’s sidestepping the big problem (aka “cheating”), but so what?
Ah, you mean because you can invoke e.g. php functions with wget / inject SQL code, thus gaining control of other computers etc.?
A more sturdy approach to just get data would be to only allow it to passively listen in on some Tier 1 provider’s backbone (no manipulation of the data flow other than mirroring packets, which is easy to formalize). Once that goal is formulated, the agent wouldn’t want to circumvent it.
Still seems plenty easier to solve than “friendliness”, as is programming it to ask for new goals after x time. Maintaining invariants under self-modification remains, as a task.
It’s not fruitful for me to propose implementations (even though I just did, heh) and for someone else to point out holes (I don’t mean to solve that task in 5 minutes), same as with you proposing full-fledged implementations for friendliness and for someone else to point out holes. Both are non-trivial tasks.
My question is this: given your current interpretation of both approaches (“passively absorb data, ask for new goals after x time” vs. “implement friendliness in the pre-foomed agent outright”), which seems more manageable while still resulting in an FAI?
A relatively non-scary possibility: The AI destroys itself, because that’s the best way to ensure it doesn’t positively ‘affect’ others in the intuitive sense you mean. (Though that would still of course have effects, so this depends on reproducing in AI our intuitive concept of ‘side-effect’ vs. ‘intended effect’....)
Scarier possibilities, depending on how we implement the goal:
the AI doesn’t kill you and then simulate you; rather, it kills you and then simulates a single temporally locked frame of you, to minimize the possibility that it (or anything) will change you.
the AI just kills everyone, because a large and drastic change now reduces to ~0 the probability that it will cause any larger perturbations later (e.g., when humans might have a big galactic civilization that it would be a lot worse to perturb).
the AI has a model of physics on which all of its actions (eventually) have a roughly equal effect on the atoms that at present compose human beings. So it treats all its possible actions (and inactions) as equivalent, and ignores your restriction in making decisions.
Yes, implementing such a goal is not easy and has pitfalls of its own, however it’s probably easi-er than the alternative, since a metric for “no large scale effects” seems easier to formalize than “human friendliness”, where we have little idea of what’s that even supposed to mean.
One usual caveat is reflective consistency: are you OK with creating a faithful representation of humans in these models and then terminating them? If so, how do you know you are not one of those models?
Your mistake here is that you buy into the overall idea of fairly specific notion of an “AI” onto which you bolt extras.
The outcome pump in the article makes a good example. You have this outcome pump coupled with some advanced fictional 3D scanners that see through walls and such, and then, within this fictional framework, you are coaxed into thinking about how to specify the motion of your mother. Meanwhile, the actual solution is that you do not add those 3D scanners in the first place, you add a button, or better yet, a keypad for entering the pin code, and a failsafe random source (that will serve as a limit on the improbability that this device causes), and enter the password when you are satisfied with the outcome, only risking perhaps a really odd form of stroke that makes you enter the password even though your mother didn’t get saved (or perhaps risking that someone ideologically opposed to the outcome pump points a gun at your head and demands you enter the password, that general sort of thing).
Likewise, actual software, or even (biological) neural networks, consist of multitude of components that serve different purposes—creating representations of the real world (which is really about optimizing a model to fit), optimizing on those, etc. You don’t ever face the problem of how you make the full blown AI just sit and listen and build a model while having a goal not to wreck stuff. As a necessary part of the full blown AI, you have the world modelling thing, which you use to that purpose, without it doing any “finding the optimal actions using a model, applying those to the world” in the first place. Likewise, “self optimization” is not in any way helped by an actual world model, grounding of concepts like paperclips and similar stuff, you just use the optimization algorithm, which works on mathematical specifications, on fairly abstract specification of the problem of making a better such optimization algorithm. It’s not in any way like having a full mind do something.
If you already know what you’re going to tell it when it asks for new goals, couldn’t you just program that in from the beginning? So the script would be, “work on your models for X years, then try to parse this statement …”
Also, re: Eliezer’s HTTP GET objection, you could just give it a giant archive of the internet and no actual connection to the outside world. If it’s just supposed to be learning and not affecting anything external, that should be sufficient (to ensure learning, not necessarily to preclude all effects on the outside world).
At this point, I think we’ve just reinvented the concept of CEV.
That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question
I didn’t come up with that myself, I got it from MugaSofer: ‘Well, since the whole genie thing is a metaphor for superintelligence, “this genie is trying to be Friendly but it’s too dumb to model you well” doesn’t really come up.’
Under reasonable definitions of “superintelligence” it does follow that a superintelligence must know what you mean, but if you pick some other definition and state so outright, I won’t argue with it. (It is, however, still subject to “talk to the intelligence to figure out what it’s going to do”.)
Of course if you are a computer scientist they will say, “But Jiro is not an elite computer scientist!”, and if you were an elite computer scientist they would say, “Elite computer scientists don’t currently take the issue seriously enough to think about it properly...
I think you’re making my case for me.
PS: If you want to reply please post a new reply to the root message since I can’t afford the karma hits to respond to you.
Whether you have a background in computer science is relevant to ongoing debates at MIRI about “How likely are people to believe X?” That no superintelligence could be dumb enough to misinterpret what we mean is the particular belief in question, but if one tries to cite your case as an example of what people believe, others shall say, “But Jiro is not a computer scientist! Perhaps computer scientists, as opposed to the general population, are unlikely to believe that.” Of course if you are a computer scientist they will say, “But Jiro is not an elite computer scientist!”, and if you were an elite computer scientist they would say, “Elite computer scientists don’t currently take the issue seriously enough to think about it properly, but this condition will reverse after X happens and causes everyone to take AI more seriously after which elite computer scientists will get the question right” but even so it would be useful data.
Some off-the-cuff thoughts on why “a superintelligence dumb enough to misinterpret what we mean” may be a contradiction in terms, given the usual meaning of superintelligence:
Intelligence is near-synonymous with “able to build accurate models and to update those models accurately”, with ‘higher intelligence’ denoting a combination of “faster model-building / updating” and/or “less prone to systematic / random errors”.
‘Super’ as a qualifier is usually applied on both dimensions, i.e. “faster and more accurately”. While this seems more like a change in degree (one intelligence hypothesis, a devoted immortal fool with an endless supply of paper and pencils could simulate the world), it also often is a change in kind, since in practice there always are resource-constraints (unless Multivax reverses entropy), often relevant enough to bar a slower-modeling agent from achieving its goals within the given constraints.
“Able to build accurate models and to update those models accurately”, then, proportionally increases “powerful, probably able to pursue its goals effectively, conditional on those goals being related to the accurate models”.
Given a high degree of the former, by definition it is not exactly very hard to acquire and emulate the shared background on which inter-human understanding is built. For an AI, understanding humans would be relevant near-regardless of its actual goals; accurate models of humans as the sine-qua-non for e.g. breaking out of the AI box. Being able to build such models quickly and accurately is what classifies the agent as “superintelligent” in the first place! If there was no incentive for the agent to model humans at all, why would there be interactions with humans, such as the human asking the agent to “rescue grandma from the burning building”? The agent, when encountering rocks and precious minerals, will probably seek models reflecting a deep understanding of those. It will do the same when encountering humans.
See, I’m d’accord with statements such as “less intelligent agents would be expected to misinterpret what we mean”, but a superintelligent agent—i.e. an agent good at building accurate models --, should by its definition by able to understand human-level intentions. If it does not, then in that respect, I wouldn’t call it a superintelligent agent.
In addition, I’d question who’d call a domain-limited expert system which is great with models only on some small subject-spectrum, but evidently abysmal with building models relevant to its goals in other respects, a “superintelligent agent”, with its connotations of general intelligence. Does the expression “a superintelligent chessbot” make sense? Or saying “x is a superintelligent human, except for doing fractions, which he absolutely cannot do”?
Before you label me an idiot who’d expect the AI to fall in love with a human princess on top of the Empire State building, allow me to stress I’m not talking about the goal-specification phase, for which no shared basis for interpretation can be expected. “The humans constructed me to stop cancer. Now, I have come to understand that humans want that in order to live longer, and I use that and all my other refined models of the human psyche to fulfill my goal. Which I do: I stop cancer, by wiping out humanity.” (Refined models cannot be used to change terminal goals, only to choose actions and subgoals to attain those goals.) More qualifications apply:
At first, such human-related models would of course be quite lacking, but probably converge fast (by definition). The problem remains of why the superintelligent agent would do what the monkeys intend it to (nevermind what they explicitly told it to), and how the monkeys could make sure of that in a way which survives self-modification. The intend-it-to / programmed-it-to dichotomy remains a problem then, since terminal goals are presumably not subject to updating/reflection, at least not as part of the ‘superintelligence’ attribute.
tl;dr: A superintelligent agent’s specified goals must be airtightly constructed, but if those include “do what the human intends, not what he says”, then the step from “words” to “intent” should be trivial. (Argument that superintelligent agents will not misinterpret humans does not apply to the goal-setting phase!)
ETA: News at 11 - News at 11 - Kawoomba solved FAI: use / leverage the foomed AIs superior model building ability (which entails that it knows what we want better than we do) by letting it solve the problem: let its initial (invariant) goal be to develop superior models of anything it encounters without affecting it (which should be easier to formalize than “friendliness”), then time that such that it will ask for “ENTER NEW GOALS” once it already established its superior models, at which point you simply tell it “ok glass, use as your new goal system that which I’d most want you to use”.
NEXT!
It’d work great if ‘affecting’ wasn’t secretly a Magical Category based on how you partition physical states into classes that are instrumentally equivalent relative to your end goals.
Point. I’d still expect some variant of “keep (general) interference minimal / do not perturb human activity / build your models using the minimal actions possible” to be easier to formalize than human friendliness, wouldn’t you?
The trouble is that communicating with a human or helping them build the real FAI in any way is going to strongly perturb the world. So actually getting anything useful this way requires solving the problem of which changes to humans, and consequent changes to the world, are allowed to result from your communication-choices.
Except it’s not, as far as the artificial agent is concerned:
Its goals are strictly limited to “develop your models using the minimal actions possible [even ‘just parse the internet, do not use anything beyond wget’ could suffice], after x number of years have passed, accept new goals from y source.” The new goals could be anything. (It could even be a boat!).
The usefulness regarding FAI becomes evident only at that latter stage, stemming from the foom’ed AI’s models being used to parse the new goals of “do that which I’d want you to do”. It’s sidestepping the big problem (aka “cheating”), but so what?
It’s allowed to emit arbitrary HTTP GETs? You just lost the game.
Ah, you mean because you can invoke e.g. php functions with wget / inject SQL code, thus gaining control of other computers etc.?
A more sturdy approach to just get data would be to only allow it to passively listen in on some Tier 1 provider’s backbone (no manipulation of the data flow other than mirroring packets, which is easy to formalize). Once that goal is formulated, the agent wouldn’t want to circumvent it.
Still seems plenty easier to solve than “friendliness”, as is programming it to ask for new goals after x time. Maintaining invariants under self-modification remains, as a task.
It’s not fruitful for me to propose implementations (even though I just did, heh) and for someone else to point out holes (I don’t mean to solve that task in 5 minutes), same as with you proposing full-fledged implementations for friendliness and for someone else to point out holes. Both are non-trivial tasks.
My question is this: given your current interpretation of both approaches (“passively absorb data, ask for new goals after x time” vs. “implement friendliness in the pre-foomed agent outright”), which seems more manageable while still resulting in an FAI?
A relatively non-scary possibility: The AI destroys itself, because that’s the best way to ensure it doesn’t positively ‘affect’ others in the intuitive sense you mean. (Though that would still of course have effects, so this depends on reproducing in AI our intuitive concept of ‘side-effect’ vs. ‘intended effect’....)
Scarier possibilities, depending on how we implement the goal:
the AI doesn’t kill you and then simulate you; rather, it kills you and then simulates a single temporally locked frame of you, to minimize the possibility that it (or anything) will change you.
the AI just kills everyone, because a large and drastic change now reduces to ~0 the probability that it will cause any larger perturbations later (e.g., when humans might have a big galactic civilization that it would be a lot worse to perturb).
the AI has a model of physics on which all of its actions (eventually) have a roughly equal effect on the atoms that at present compose human beings. So it treats all its possible actions (and inactions) as equivalent, and ignores your restriction in making decisions.
Yes, implementing such a goal is not easy and has pitfalls of its own, however it’s probably easi-er than the alternative, since a metric for “no large scale effects” seems easier to formalize than “human friendliness”, where we have little idea of what’s that even supposed to mean.
One usual caveat is reflective consistency: are you OK with creating a faithful representation of humans in these models and then terminating them? If so, how do you know you are not one of those models?
Your mistake here is that you buy into the overall idea of fairly specific notion of an “AI” onto which you bolt extras.
The outcome pump in the article makes a good example. You have this outcome pump coupled with some advanced fictional 3D scanners that see through walls and such, and then, within this fictional framework, you are coaxed into thinking about how to specify the motion of your mother. Meanwhile, the actual solution is that you do not add those 3D scanners in the first place, you add a button, or better yet, a keypad for entering the pin code, and a failsafe random source (that will serve as a limit on the improbability that this device causes), and enter the password when you are satisfied with the outcome, only risking perhaps a really odd form of stroke that makes you enter the password even though your mother didn’t get saved (or perhaps risking that someone ideologically opposed to the outcome pump points a gun at your head and demands you enter the password, that general sort of thing).
Likewise, actual software, or even (biological) neural networks, consist of multitude of components that serve different purposes—creating representations of the real world (which is really about optimizing a model to fit), optimizing on those, etc. You don’t ever face the problem of how you make the full blown AI just sit and listen and build a model while having a goal not to wreck stuff. As a necessary part of the full blown AI, you have the world modelling thing, which you use to that purpose, without it doing any “finding the optimal actions using a model, applying those to the world” in the first place. Likewise, “self optimization” is not in any way helped by an actual world model, grounding of concepts like paperclips and similar stuff, you just use the optimization algorithm, which works on mathematical specifications, on fairly abstract specification of the problem of making a better such optimization algorithm. It’s not in any way like having a full mind do something.
If you already know what you’re going to tell it when it asks for new goals, couldn’t you just program that in from the beginning? So the script would be, “work on your models for X years, then try to parse this statement …”
Also, re: Eliezer’s HTTP GET objection, you could just give it a giant archive of the internet and no actual connection to the outside world. If it’s just supposed to be learning and not affecting anything external, that should be sufficient (to ensure learning, not necessarily to preclude all effects on the outside world).
At this point, I think we’ve just reinvented the concept of CEV.
I didn’t come up with that myself, I got it from MugaSofer: ‘Well, since the whole genie thing is a metaphor for superintelligence, “this genie is trying to be Friendly but it’s too dumb to model you well” doesn’t really come up.’
Under reasonable definitions of “superintelligence” it does follow that a superintelligence must know what you mean, but if you pick some other definition and state so outright, I won’t argue with it. (It is, however, still subject to “talk to the intelligence to figure out what it’s going to do”.)
I think you’re making my case for me.
PS: If you want to reply please post a new reply to the root message since I can’t afford the karma hits to respond to you.