The Blue-Minimizing Robot
Imagine a robot with a turret-mounted camera and laser. Each moment, it is programmed to move forward a certain distance and perform a sweep with its camera. As it sweeps, the robot continuously analyzes the average RGB value of the pixels in the camera image; if the blue component passes a certain threshold, the robot stops, fires its laser at the part of the world corresponding to the blue area in the camera image, and then continues on its way.
Watching the robot’s behavior, we would conclude that this is a robot that destroys blue objects. Maybe it is a surgical robot that destroys cancer cells marked by a blue dye; maybe it was built by the Department of Homeland Security to fight a group of terrorists who wear blue uniforms. Whatever. The point is that we would analyze this robot in terms of its goals, and in those terms we would be tempted to call this robot a blue-minimizer: a machine that exists solely to reduce the amount of blue objects in the world.
Suppose the robot had human level intelligence in some side module, but no access to its own source code; that it could learn about itself only through observing its own actions. The robot might come to the same conclusions we did: that it is a blue-minimizer, set upon a holy quest to rid the world of the scourge of blue objects.
But now stick the robot in a room with a hologram projector. The hologram projector (which is itself gray) projects a hologram of a blue object five meters in front of it. The robot’s camera detects the projector, but its RGB value is harmless and the robot does not fire. Then the robot’s camera detects the blue hologram and zaps it. We arrange for the robot to enter this room several times, and each time it ignores the projector and zaps the hologram, without effect.
Here the robot is failing at its goal of being a blue-minimizer. The right way to reduce the amount of blue in the universe is to destroy the projector; instead its beams flit harmlessly through the hologram.
Again, give the robot human level intelligence. Teach it exactly what a hologram projector is and how it works. Now what happens? Exactly the same thing—the robot executes its code, which says to scan the room until its camera registers blue, then shoot its laser.
In fact, there are many ways to subvert this robot. What if we put a lens over its camera which inverts the image, so that white appears as black, red as green, blue as yellow, and so on? The robot will not shoot us with its laser to prevent such a violation (unless we happen to be wearing blue clothes when we approach) - its entire program was detailed in the first paragraph, and there’s nothing about resisting lens alterations. Nor will the robot correct itself and shoot only at objects that appear yellow—its entire program was detailed in the first paragraph, and there’s nothing about correcting its program for new lenses. The robot will continue to zap objects that register a blue RGB value; but now it’ll be shooting at anything that is yellow.
The human-level intelligence version of the robot will notice its vision has been inverted. It will know it is shooting yellow objects. It will know it is failing at its original goal of blue-minimization. And maybe if it had previously decided it was on a holy quest to rid the world of blue, it will be deeply horrified and ashamed of its actions. It will wonder why it has suddenly started to deviate from this quest, and why it just can’t work up the will to destroy blue objects anymore.
The robot goes to Quirinus Quirrell, who explains that robots don’t really care about minimizing the color blue. They only care about status and power, and pretend to care about minimizing blue in order to impress potential allies.
The robot goes to Robin Hanson, who explains that there are really multiple agents within the robot. One of them wants to minimize the color blue, the other wants to minimize the color yellow. Maybe the two of them can make peace, and agree to minimize yellow one day and blue the next?
The robot goes to Anna Salamon, who explains that robots are not automatically strategic, and that if it wants to achieve its goal it will have to learn special techniques to keep focus on it.
I think all of these explanations hold part of the puzzle, but that the most fundamental explanation is that the mistake began as soon as we started calling it a “blue-minimizing robot”. This is not because its utility function doesn’t exactly correspond to blue-minimization: even if we try to assign it a ponderous function like “minimize the color represented as blue within your current visual system, except in the case of holograms” it will be a case of overfitting a curve. The robot is not maximizing or minimizing anything. It does exactly what it says in its program: find something that appears blue and shoot it with a laser. If its human handlers (or itself) want to interpret that as goal directed behavior, well, that’s their problem.
It may be that the robot was created to achieve a specific goal. It may be that the Department of Homeland Security programmed it to attack blue-uniformed terrorists who had no access to hologram projectors or inversion lenses. But to assign the goal of “blue minimization” to the robot is a confusion of levels: this was a goal of the Department of Homeland Security, which became a lost purpose as soon as it was represented in the form of code.
The robot is a behavior-executor, not a utility-maximizer.
In the rest of this sequence, I want to expand upon this idea. I’ll start by discussing some of the foundations of behaviorism, one of the earliest theories to treat people as behavior-executors. I’ll go into some of the implications for the “easy problem” of consciousness and philosophy of mind. I’ll very briefly discuss the philosophical debate around eliminativism and a few eliminativist schools. Then I’ll go into why we feel like we have goals and preferences and what to do about them.
- Secrets of the eliminati by 20 Jul 2011 10:15 UTC; 133 points) (
- The Library of Scott Alexandria by 14 Sep 2015 1:38 UTC; 126 points) (
- Urges vs. Goals: The analogy to anticipation and belief by 24 Jan 2012 23:57 UTC; 126 points) (
- An artificially structured argument for expecting AGI ruin by 7 May 2023 21:52 UTC; 91 points) (
- Bridge Collapse: Reductionism as Engineering Problem by 18 Feb 2014 22:03 UTC; 81 points) (
- To what degree do we have goals? by 15 Jul 2011 23:11 UTC; 80 points) (
- Do Humans Want Things? by 4 Aug 2011 5:00 UTC; 40 points) (
- 24 Jan 2012 18:03 UTC; 36 points) 's comment on The Human’s Hidden Utility Function (Maybe) by (
- Engaging First Introductions to AI Risk by 19 Aug 2013 6:26 UTC; 31 points) (
- What’s wrong with simplicity of value? by 27 Jul 2011 3:09 UTC; 29 points) (
- Pitfalls of the agent model by 27 Apr 2021 22:19 UTC; 25 points) (
- Why Do I Think I Have Values? by 3 Feb 2022 13:35 UTC; 22 points) (
- What if sympathy depends on anthropomorphizing? by 24 Jul 2011 12:33 UTC; 22 points) (
- 15 Apr 2012 3:42 UTC; 16 points) 's comment on Our Phyg Is Not Exclusive Enough by (
- 20 Aug 2012 22:45 UTC; 14 points) 's comment on Doublethink (Choosing to be Biased) by (
- The blue-minimising robot and model splintering by 28 May 2021 15:09 UTC; 13 points) (
- Reducing Agents: When abstractions break by 31 Mar 2018 0:03 UTC; 13 points) (
- 27 Jun 2019 14:53 UTC; 13 points) 's comment on Embedded Agency: Not Just an AI Problem by (
- [Link] Selfhood bias by 16 Jan 2013 16:05 UTC; 7 points) (
- 22 Aug 2011 8:43 UTC; 6 points) 's comment on A Sketch of an Anti-Realist Metaethics by (
- 19 Jan 2018 1:51 UTC; 6 points) 's comment on Beware of black boxes in AI alignment research by (
- 9 Feb 2021 16:07 UTC; 6 points) 's comment on Why I Am Not in Charge by (
- Troubles With CEV Part1 - CEV Sequence by 28 Feb 2012 4:15 UTC; 6 points) (
- 22 Feb 2013 6:24 UTC; 5 points) 's comment on Welcome to Less Wrong! (July 2012) by (
- 21 Jul 2013 18:33 UTC; 5 points) 's comment on The idiot savant AI isn’t an idiot by (
- 23 Feb 2019 8:34 UTC; 5 points) 's comment on Two Small Experiments on GPT-2 by (
- An algorithm with preferences: from zero to one variable by 2 Jun 2017 16:35 UTC; 4 points) (
- 6 Jun 2012 0:38 UTC; 4 points) 's comment on Consider a robot vacuum. by (
- 2 Feb 2013 6:11 UTC; 4 points) 's comment on Welcome to Less Wrong! (July 2012) by (
- 24 Feb 2012 16:58 UTC; 3 points) 's comment on The mathematics of reduced impact: help needed by (
- 25 Nov 2018 10:28 UTC; 3 points) 's comment on Values Weren’t Complex, Once. by (
- An algorithm with preferences: from zero to one variable by 15 Nov 2016 15:14 UTC; 3 points) (
- 17 Aug 2017 11:46 UTC; 3 points) 's comment on We need to think more about Terminal Values by (
- 20 Mar 2012 22:54 UTC; 3 points) 's comment on Calibrate your self-assessments by (
- 14 Jun 2013 22:40 UTC; 3 points) 's comment on Effective Altruism Through Advertising Vegetarianism? by (
- 24 Aug 2012 15:06 UTC; 2 points) 's comment on Not for the Sake of Pleasure Alone by (
- 26 Jan 2013 8:11 UTC; 2 points) 's comment on Welcome to Heaven by (
- 27 Nov 2017 18:07 UTC; 2 points) 's comment on Qualitative differences by (
- 21 Nov 2022 18:12 UTC; 2 points) 's comment on How to store human values on a computer by (
- 6 Apr 2018 6:18 UTC; 2 points) 's comment on My take on agent foundations: formalizing metaphilosophical competence by (
- 22 May 2024 20:10 UTC; 1 point) 's comment on quila’s Shortform by (
- 10 Dec 2012 21:23 UTC; 1 point) 's comment on By Which It May Be Judged by (
- 27 Feb 2012 14:06 UTC; 0 points) 's comment on Yet another safe oracle AI proposal by (
- 7 Sep 2013 18:51 UTC; 0 points) 's comment on The genie knows, but doesn’t care by (
- 8 Feb 2015 22:00 UTC; 0 points) 's comment on [LINK] Wait But Why—The AI Revolution Part 2 by (
- 18 Sep 2016 20:26 UTC; 0 points) 's comment on Learning values versus learning knowledge by (
- 27 Mar 2012 19:27 UTC; 0 points) 's comment on AI Risk and Opportunity: Humanity’s Efforts So Far by (
- 27 Mar 2012 18:05 UTC; 0 points) 's comment on AI Risk and Opportunity: Humanity’s Efforts So Far by (
- 4 Apr 2013 7:56 UTC; 0 points) 's comment on We Don’t Have a Utility Function by (
- 31 Mar 2014 11:09 UTC; 0 points) 's comment on Friendly AI ideas needed: how would you ban porn? by (
The robot is not consequentialist, its decisions are not controlled by the dependence of facts about the future on its decisions.
Good point, but the fact that humans are consequentialists (at least partly) doesn’t seem to make the problem much easier. Suppose we replace Yvain’s blue-minimizer robot with a simple consequentialist robot that has the same behavior (let’s say it models the world as a 2D grid of cells that have intrinsic color, it always predicts that any blue cell that it shoots at will turn some other color, and its utility function assigns negative utility to the existence of blue cells). What does this robot “actually want”, given that the world is not really a 2D grid of cells that have intrinsic color?
Who cares about the question what the robot “actually wants”? Certainly not the robot. Humans care about the question what they “actually want”, but that’s because they have additional structure that this robot lacks. But with humans, you’re not limited to just looking at what they do on auto-pilot; instead, you can just ask the aforementioned structure when you run into problems like this. For example, if you asked me what I really wanted under some weird ontology change, I could say, “I have some guesses, but I don’t really know; I would like to defer to a smarter version of me”. That’s how I understand preference extrapolation: not as something that looks at what your behavior suggests that you’re trying to do and then does it better, but as something that poses the question of what you want to some system you’d like to answer the question for you.
It looks to me like there’s a mistaken tendency among many people here, including some very smart people, to say that I’d be irrational to let my stated preferences deviate from my revealed preferences; that just because I seem to be trying to do something (in some sense like: when my behavior isn’t being controlled much by the output of moral philosophy, I can be modeled as a relatively good fit to a robot with some particular utility function), that’s a reason for me to do it even if I decide that I don’t want to. But rational utility maximizers get to be indifferent to whatever the heck they want, including their own preferences, so it’s hard for me to see why the underdeterminedness of the true preferences of robots like this should bother me at all.
Insert standard low confidence about me posting claims on complicated topics that others seem to disagree with.
In other words, our “actual values” come from our being philosophers, not our being consequentialists.
It seems plausible to me, and I’m not sure that “many” others do disagree with you.
That would imply a great diversity of value systems, because philosophical intuitions differ much more from person to person than primitive desires. Some of these value systems (maybe including yours) would be simple, some wouldn’t. For example, my “philosophical” values seem to give large weight to my “primitive” values.
That might be a procedure that generates human preference, but it is not a general preference extrapolation procedure. E.g suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “A version of myself better able to answer that question. Maybe it should be smarter and know more things and be nicer to strangers and not have scope insensitivity and be less prone to skipping over invisible moral frameworks and have conecepts that are better defined over attribute space and be automatically strategic and super commited and stuff like that? But since I’m not that smart and I pass over moral frameworks and stuff, eveything I just said is probably insufficient to specify the right thing. Maybe you can look at my source code and figure out what I mean by right and then do the thing that a person who better understood that would do?” And then goes right back to zapping blue.
Suppose we replace Wei Dai’s simple consequentialist robot with a robot that has similar behavior, but that also responds to the question, “What system do you want to answer the question of what you want for you?” with the answer, “I want to decide for myself” and responds to the question, “What do you want to do?” with the answer, “I want to make babies happy. Oh and help grandmother out of the burning building. Oh, and without killing her. Oh, and to preserve complex novelty. Oh and boredom. Oh, and there should still be people in the world who are trying to improve it. Oh and...damnit, this is complicated. Okay, never mind, I want you to ask the version of myself who I presently think is smart enough to answer this question and who knows what the right thing to do is even better than me.”
It can answer those two questions, but if you ask it to clarify the last response, it just blows up.
Actually, this notion of consequentialism gives a new and the only clue I know of about how to infer agent goals, or how to constrain the kinds of considerations that should be considered goals, as compared to the other stuff that moves your action incidentally, such as psychological drives or laws of physics. I wonder if Eliezer had this insight before, given that he wrote a similar comment to this thread. I wasn’t ready to see this idea on my own until a few weeks ago, and this thread is the first time I thought about the question given the new framework, and saw the now-obvious construction. This deserves more than a comment, so I’ll be working on a two-post sequence to write this up intelligibly. Or maybe it’s actually just stupid, I’ll try to figure that out.
(A summary from my notes, in case I get run over by a bus; this uses a notion of “dependence” for which a toy example is described in my post on ADT, but which is much more general: )
The idea of consequentialism, of goal-directed control, can be modeled as follows. If a fact A is controlled by (can be explained/predicted based on) a dependence F: A->O, then we say that A is a decision (action) driven by a consequentialist consideration F, which in turn looks at how A controls the morally relevant fact O.
For a given decision A, there could be many different morally relevant facts O such that the dependence A->O has explanatory power about A. The more about A can a dependence A->O explain, the more morally relevant O is. Finding highly relevant facts O essentially captures A’s goals.
This model has two good properties. First, logical omniscience (in particular, just knowledge of actual action) renders the construction unusable, since we need to see dependencies A->O as ambient concepts explaining A, so both A and A->O need to remain potentially unknown. (This is the confusing part. It also lends motivation to the study of complete collection of moral arguments and the nature of agent-provable collection of moral arguments.)
Second, action (decision) itself, and many other facts that control the action but aren’t morally relevant, are distinguished by this model from the things that are. For example, A can’t be morally relevant, for that would require the trivial identity dependence A->A to explain A, which it can’t, since it’s too simple. Similarly for other stuff in simple relationship with A: the relationship between A and a fact must be in tune with A for the fact to be morally relevant, it’s not enough for the fact itself to be in tune with A.
This question doesn’t require a fixed definition for a goal concept, instead it shows how various concepts can be regarded as goals, and how their suitability for this purpose can be compared. The search for better morally relevant facts is left open-ended.
I very much look forward to your short sequence on this. I hope you will also explain your notion of dependence in detail.
For the record, I mostly completed a draft of a prerequisite post (first of the two I had in mind) a couple of weeks ago, and it’s just no good, not much better than what one would take away from reading the previously published posts, and not particularly helpful in clarifying the intuition expressed in the above comments. So I’m focusing on improving my math skills, which I expect will help with formalization/communication problem (given a few months), as well as with moving forward. I might post some version of the post, but it seems it won’t be able to serve the previously intended purpose.
Bummer.
As for communication, it would help me (at least) if you used words in their normal senses unless they are a standard LW term of art (e.g. ‘rationalist’ means LW rationalist not Cartesian rationalist ) or unless you specify that you’re using the term in an uncommon sense.
Don’t see how this is related to this thread, and correspondingly what kinds of word misuse you have in mind.
It isn’t related to this thread. I was thinking of past confusions between us over ‘metaethics’ and ‘motivation’ and ‘meaning’ where I didn’t realize until pretty far into the discussion that you were using these terms to mean something different than they normally mean. I’d generally like to avoid that kind of thing; that’s all I meant.
Well, I’m mostly not interested in the concepts corresponding to how these words are “normally” used. The miscommunication problems resulted from both me misinterpreting your usage, and your misinterpreting my usage. I won’t be misinterpreting your usage in similar cases in the future, as I now know better what you mean by which words, and in my own usage, as we discussed a couple of times, I’ll be more clear through using disambiguating qualifiers, which in most cases amounts to writing word “normative” more frequently.
(Still unclear/strange why you brought it up in this particular context, but no matter...)
Yup, that sounds good.
I brought it up because you mentioned communication, and your comment showed up in my LW inbox today.
The robot wants to minimize the amount of blue it sees in its grid representation of the world. It can do this by affecting the world with a laser. But it could also change its camera system so that it sees less blue. If there is no term in the utility function that says that the grid has to model reality, then both approaches are equally valid.
steven0461′s comment notwithstanding, I can take a guess at what the robot actually wants. I think it wants to take the action that will minimize the number of blue cells existing in the world, according to the robot’s current model of the world. That rule for choosing actions probably doesn’t correspond to any coherent utility function over the real world, but that’s not really a surprise.
The interesting question that you probably meant to ask is whether the robot’s utility function over its model of the world can be converted to a utility function over the real world. But the robot won’t agree to any such upgrade, so the question is kinda moot.
That might sound hopeless for CEV, but fortunately humans aren’t consequentialists with a fixed model of the world. Instead they seem to be motivated by pleasure and pain, which you can’t disprove out of existence by coming up with a better model. So maybe there’s hope in that direction.
To avoid SEEING blue things. If the model is good enough for it it’d search out a mirror and laser it’s own camera so that it could NEVER see a blue pixel again.
This can be modelled using human empathy by equating the sensation of seeing blue with pain. You don’t care to minimize damage to your body (if it’s not somehting that actually cripples you), but you care about not getting the signal about it happening, and your reaction to a pill that turned you masochist would be very different than your reaction to a murder pill.
Edit: huh? I am surprised that this is downvoted, and the most probable reason is that I’m wrong in some obvious way that I can’t see, can someone please tell me how? (Or maybe my usage of empathy was interpreted way to literally. )
(Reading this comment first might be helpful.)
To answer your thought experiment. It doesn’t matter what the agent thinks it’s acting based on, we look at it from the outside instead (but using a particular definition/dependence that specifies the agent), and ask how its action depends on the dependence of the actual future on its actual action. Agent’s misconceptions don’t enter this question. If the misconceptions are great, it’ll turn out that the dependence of actual future on agent’s action doesn’t control its action, or controls it in some unexpected way. Alternatively, we could say that it’s not the actual future that is morally relevant for the agent, but some other strange fact, in which case the agent could be said to be optimizing a world that is not ours. From yet another perspective, the role of the action could be played by something else, but then it’s not clear why we are considering such a model and talking about this particular actual agent at the same time.
Is that something you can see from the outside? If I argmax over actions in expected-paper-clips or over updateless-prior-expected-paper-clips, how can you translate my black box behavior over possible worlds into the dependence of my behavior on the dependence of the worlds on my behavior?
See the section “Utility functions” of this post: it shows how a dependence between two fixed facts could be restored in an ideal case where we can learn everything there is to learn about it. Similarly, you could consider the fact of which dependence holds between two facts, with various specific functions as its possible values, and ask what can you infer about that other fact if you assume that the dependence is given by a certain function.
More generally, a dependence follows possible inferences, things that could be inferred about one fact if you learn new things about the other fact. It needs to follow all of such inferences, to the best of agent’s ability, otherwise it won’t be right and you’ll get incorrect decisions (counterfactual models).
Edit: Actually, never mind, I missed your point. Will reply again later. (Done.)
Our world is not really a 2D grid, its world could be. It won’t be a consequentialist about our world then, for that requires the dependence of its decisions on the dependence of our world on its decisions. It looks like the robot you described wants to minimize the 2D grid greenness, and would do that sensibly in the context of 2D grid world, or any world that can influence the 2D grid world. For the robot, our world doesn’t exist in the same sense as 2D grid world doesn’t exist for us, even though we could build an instance of the robot in our world. If we do build such an instance, the robot, if extremely rational and not just acting on heuristics adapted for its natural habitat, could devote its our-worldly existence to finding ways of acausally controlling its 2D world. For example, if there are some 2D-worlders out there simulating our world, it could signal to them something that is expected to reduce greenness.
(This all depends on the details of robot’s decision-making tools, of course. It could really be talking about our world, but then its values collapse, and it could turn out to not be a consequentialist after all, or optimizing something very different, to the extent the conflict in the definitions is strong.)
The conclusion I’d draw from this essay is that one can’t necessarily derive a “goal” or a “utility function” from all possible behavior patterns. If you ask “What is the robot’s goal?”, the answer is, “it doesn’t have one,” because it doesn’t assign a total preference ordering to states of the world. At best, you could say that it prefers state [I SEE BLUE AND I SHOOT] to state [I SEE BLUE AND I DON’T SHOOT]. But that’s all.
This has some implications for AI, I think. First of all, not every computer program has a goal or a utility function. There is no danger that your TurboTax software will take over the world and destroy all human life, because it doesn’t have a general goal to maximize the number of completed tax forms. Even rather sophisticated algorithms can completely lack goals of this kind—they aren’t designed to maximize some variable over all possible states of the universe. It seems that the narrative of unfriendly AI is only a risk if an AI were to have a true goal function, and many useful advances in artificial intelligence (defined in the broad sense) carry no risk of this kind.
Do humans have goals? I don’t know; it’s plausible that we have goals that are complex and hard to define succinctly, and it’s also plausible that we don’t have goals at all, just sets of instructions like “SHOOT AT BLUE.” The test would seem to be if a human goal of “PROMOTE VALUE X” continues to imply behaviors in strange and unfamiliar circumstances, or if we only have rules of behavior in a few common situations. If you can think clearly about ethics (or preferences) in the far future, or the distant past, or regarding unfamiliar kinds of beings, and your opinions have some consistency, then maybe those ethical beliefs or preferences are goals. But probably many kinds of human behavior are more like sets of instructions than goals.
No; placing a blue-tinted mirror in front of him will have him shoot himself even though that greatly diminishes his future ability to shoot. Generally a generic program really can’t be assigned any nontrivial utility function.
Destroying the robot greatly diminishes its future ability to shoot, but it would also greatly diminishes its future ability to see blue. The robot doesn’t prefer ‘shooting blue’ to ‘not shooting blue’, it prefers ‘seeing blue and shooting’ to ‘seeing blue and not shooting’.
So the original poster was right.
Edit: I’m wrong, see below
If the robot knows that its camera is indestructible but its gun isn’t, it would still shoot at the mirror and destroy only its gun.
So it would be [I SEE BLUE AND I TRY TO SHOOT].
… except that it wouldn’t mind if shooting itself damaged its own program so that it wouldn’t even try to shoot if it saw blue anymore.
Ok, I am inclined to agree that its behaviour can’t be described in terms of goals.
This is a very awesome post. Thumbs up.
What does it mean for a program to have intelligence if it does not have a goal? (or have components that have goals)
The point of any incremental intelligence increase is to let the program make more choices, and perhaps choices at higher levels of abstraction. Even at low intelligence levels, the AI will only ‘do a good job’ if the basis of those choices adequately matches the basis we would use to make the same choice. (a close match at some level of abstraction below the choice, not the substrate and not basic algorithms)
Creating ‘goal-less’ AI still has the machine making more choices for more complex reasons, and allows for non-obvious mismatches between what it does and what we intended it to do.
Yes, you can look at paperclip-manufacturing software and see that it is not a paper-clipper, but some component might still be optimizing for something else entirely. We can reject the anthropomorphically obvious goal and there can still be an powerful optimization process that affects the total system, at the expense of both human values and produced paperclips.
Consider automatic document translation. Making the translator more complex and more accurate doesn’t imbue it with goals. It might easily be the case that in a few years, we achieve near-human accuracy at automatic document translation without major breakthroughs in any other area of AI research.
Making it more accurate is not the same as making it more intelligent. The question is: How does making something “more intelligent” change the nature of the inaccuracies? In translation especially there can be a bias without any real inaccuracy .
Goallessness at the level of the program is not what makes translators safe. They are safe because neither they nor any component is intelligent.
Most professional computer scientists and programmers I know routinely talk about “smart”, “dumb”, “intelligent” etc algorithms. In context, a smarter algorithm exploits more properties of the input or the problem. I think this is a reasonable use of language, and it’s the one I had in mind.
(I am open to using some other definition of algorithmic intelligence, if you care to supply one.)
I don’t see why making an algorithm smarter or more general would make it dangerous, so long as it stays fundamentally a (non-self-modifying) translation algorithm. There certainly will be biases in a smart algorithm. But dumb algorithms and humans have biases too.
I generally go with cross domain optimization power. http://wiki.lesswrong.com/wiki/Optimization_process Note that optimization target is not the same thing as a goal, and the process doesn’t need to exist within obvious boundaries. Evolution is goalless and disembodied.
If an algorithm is smart because a programmer has encoded everything that needs to be known to solve a problem, great. That probably reduces potential for error, especially in well-defined environments. This is not what’s going on in translation programs, or even the voting system here. (based on reddit) As systems like this creep up in complexity, their errors and biases become more subtle. (especially since we ‘fix’ them so that they usually work well) If an algorithm happens to be powerful in multiple domains, then the errors themselves might be optimized for something entirely different, and perhaps unrecognizable.
By your definition I would tend to agree that they are not dangerous, so long as their generalized capabilities are below human level, (seems to be the case for everything so far) with some complex caveats. For example ‘non-self-modifying’ is a likely false sense of security. If an AI has access to a medium which can be used to do computations, and the AI is good at making algorithms, then it could (Edit: It could build a powerful if not superintelligent program.)
Also, my concern in this thread has never been about the translation algorithm, the tax program, or even the paperclipper. It’s about some sub-process which happens to be a powerful optimizer. (in a hypothetical situation where we do more AI research on the premise that it is safe if it is in a goalless program.
This is a very interesting question, thanks for making me think about it.
(Based on your other comments elsewhere in this thread), it seems like you and I are in agreement that intelligence is about having the capability to make better choices. That is, two agents given an identical problem and identical resources to work with, the agent that is more intelligent is more likely to make the “better” choice.
What does “better” mean here? We need to define some sort of goal and then compare the outcome of their choices and how closely those outcome matches those goals. I have a couple of disorganized thoughts here:
The goal is just necessary for us, outsiders, to compare the intelligence of the two agents. The goal is not necessary for the existence of intelligence in the agents if no one’s interested in measuring their intelligence.
Assuming the agents are cooperative, you can temporarily assign subgoals. For example, perhaps you and I would like to know which one of us is smarter. You and I might have many different goals, but we might agree to temporarily take on a similar goal (e.g. win this game of chess, or get the highest amount of correct answers on this IQ test, etc.) so that our intelligence can be compared.
The “assigning” of goals to an intelligence strongly implies to me that goals are orthogonal to intelligence. Intelligence is the capability to fulfil any general goal, and it’s possible for someone to be intelligent even if they do not (currently, or ever) have any goals. If we come up with a new trait called Sodadrinkability which is the capability to drink a given soda, one can say that I possess Sodadrinkability—that I am capable of drinking a wide range of possible sodas provided to me—even if I do not currently (or ever) have any sodas to drink.
Let me suggest that the difference between goal-less behavior and goal-driven behavior, is that goal-driven behavior seeks means to attain its end. The means will vary with circumstances, while the end remains relatively invariant. Another indication of goal-driven behavior is that means are often prepared in anticipation of need, rather than in response to present need.
I said “relatively invariant” because goals can be and often are heirarical. An example was outlined by Maslow in his “A Theory of Human Motivation” in the Psychological Review (1943). Maslow aside, in problem solving, we often resort to staged solutions in which the means to a higher order goal become a new sub-goal and so on iteratively—until we reach low level goals within our immediate grasp.
A second point is that terms such as “purposeful” and “goal-seeking” are analogously predicated. To be analogousley predicated, a term is applied to differnt cases with a meaning that is partly the same and partly different. Thus, a goal-seeking robot is not goal-seeking because it intends any goals of its own, but because it is the vehicle by which designer seeks to effect his or her goals. In the parable, if the goal was the destruction of blue uniformed enemies, that goal was only intended by the robots creators. Since the robot is an instantiated means of attaining that gosl, we may speak, analogously, of it as having the same goal. The important point is that we mean different things in saying “the designer has a goal” and “the robot has a goal.” Each works toward same end (so the meaning iis partly the same), but only the designer intends that end (so the meaning iis partly different). (BTW this kind of analogy is an “analogy of attribution.”)
The fact that the robot is ineffective in attaining its end is a side issue that might be solved by employing better algoritms (edge and pattern recognition, etc.) There is no evidence that better algorithms will give the robot intentions in the sense that the designer has intentions.
Also, you misspelled my name—it’s Quirinus, not Quirinius.
Hijacking top comment… To finish reading Yvain’s sequence, check out the corresponding sequence page).
I’ll be interested to see where you go with this, but it seems to me that saying, “look, this is the program the robot runs, therefore it doesn’t really have a goal”, is exactly like saying “look, it’s made of atoms, therefore it doesn’t really have a goal”.
Goals are precisely explained (like rainbows), and not explained away (like kobolds), as the controlled variables of control systems. This robot is such a system. The hypothetical goal of its designers at the Department of Homeland Security is also a goal. That does not make the robot’s goal not a goal; it just makes it a different goal.
We feel like we have goals and preferences because we do, in fact, have goals and preferences, and we not only have them, but we are also aware of having them. The robot is not aware of having the goal that it has. It merely has it.
First of all, your control theory work was...not exactly what started me thinking along these lines, but what made it click when I realized the lines I had been thinking along were similar to the ones I had read about in one of your introductory posts about performing complex behaviors without representations. So thank you.
Second—When you say the robot has a “different goal”, I’m not sure what you mean. What is the robot’s goal? To follow the program detailed in the first paragraph?
Let’s say Robot-1 genuinely has the goal to kill terrorists. If a hacker were to try to change its programming to “make automobiles” instead, Robot-1 would do anything it could to thwart the hacker; its goal is to kill terrorists, and letting a hacker change its goal would mean more terrorists get left alive. This sort of stability, in which the preference remains a preference regardless of context are characteristic of my definition of “goal”.
This “blue-minimizing robot” won’t display that kind of behavior. It doesn’t thwart the person who places a color inversion lens on it (even though that thwarts its stated goal of “minimizing blue”), and it wouldn’t try to take the color inversion lens off even if it had a manipulator arm. Even if you claim its goal is just to “follow its program”, it wouldn’t use its laser to stop someone walking up to it and changing its program, which means its program no longer got followed.
This isn’t just a reduction of a goal to a program: predicting the robot’s goal-based behavior and its program-based behavior give different results.
If goals reduce to a program like the robot’s in any way, it’s in the way that Einsteinian mechanics “reduce” to Newtonian mechanics—giving good results in most cases but being fundamentally different and making different predictions on border cases. Because there are other programs that goals do reduce to, like the previously mentioned Robot-1, I don’t think it’s appropriate to call what the blue-minimizer is doing a “goal”.
If you still disagree, can you say exactly what goal you think the robot is pursuing, so I can examine your argument in more detail?
The robot’s goal is not to follow its own program. The program is simply what the robot does. In the environment it is designed to operate in, what it does is destroy blue objects. In the vocabulary of control theory, the controlled variable is the number of blue objects, the reference value is zero, the difference between the two is the error, firing the laser is the action is takes when the error is positive, and the action has the effect of reducing the error. The goal, as with any control system, is to keep the error at zero. It does not have an additional goal of being the best destroyer of blue objects possible. Its designers might have that goal, but if so, that goal is in the designers, not in the system they have designed.
In an environment containing blue objects invulnerable to laser fire, the robot will fail to control the number of blue objects at zero. That does not make it not a control system, just a control system encountering disturbances it is unable to control. To ask whether it is still a control system veers into a purely verbal argument, like asking whether a table is still a table if one leg has broken off and it cannot stand upright.
People are more complex. They have (according to PCT) a large hierarchy of control systems (very broad, but less than a dozen levels deep), in which the reference signal for each controller is set by the output signals of higher level controllers. (At the top, reference signals are presumably hard-wired, and at the bottom, output signals go to organs not made of neurons—muscles, mainly.) In addition, the hierarchy is subject to reorganisation and other forms of adaptation. The adaptations present to consciousness are the ability to think about our goals, consider whether we are doing the best things to achieve them, and change what we are doing. The robot in the example cannot do this.
You might be thinking of “goal” as meaning this sort of conscious, reflective, adaptive attempt to achieve what we “really” want, but I find that too large and fuzzy a concept. It leads into a morass of talk about our “real” goals vs. the goals we think we have, self-reflexive decision theory, extreme thought experiments, and so on. A real science of living things has to start smaller, with theories and observations that can be demonstrated as surely and reproducibly as the motion of balls rolling down inclined planes.
(ETA: When the neuroscience fails to discover this huge complex thing that never carved reality at the joints in the first place, people respond by saying it doesn’t exist, that it went the way of kobolds rather than rainbows.)
Maybe you’re also thinking of this robot’s program as a plain stimulus-response system, as in the behaviourist view of living systems. But what makes it a control system is the environment it is embedded in, an environment in which shooting at blue objects destroys them.
If I replace “program” by “behaviourism”, then I would say that it is behaviourism that is explained away by PCT.
Now I’m very confused. I understand that you think humans are PCT systems and that you have some justifications for that. But unlike humans, we know exactly what motivates this robot (the program in the first paragraph) and it doesn’t contain a controlled variable corresponding to the number of blue objects, or anything else that sounds PCT.
So are you saying that any program can be modeled by PCT better than by looking at the program itself, or that although this particular robot isn’t PCT, a hypothetical robot that was more reflective of real human behavior would be?
As for goals, if I understand your definition correctly, even a behaviorist system could be said to have goals (if you reinforce it every time it pulls the lever, then its new goal will be to pull a lever). If that’s your definition, I agree that this robot has goals, and I would rephrase my thesis as being that those goals are not context-independent and reflective.
I am saying that this particular robot (without the add-on human module) is a control system. It consists of nothing more than that single control system. It contains no representation of any part of itself. It does not reflect on its nature, or try to find other ways of achieving its goal.
The hierarchical arrangement of control systems that HPCT (Hierarchical PCT) ascribes to humans and other living organisms, is more complex. Humans have goals that are instrumental towards other goals, and which are discarded as soon as they become ineffective for those higher-level goals.
Behaviourism is a whole other can of worms. It models living organisms as stimulus-response systems, in which outputs are determined by perceptions. PCT is the opposite: perceptions are determined by outputs.
I agree with you that behaviorism and PCT are different, which is why I don’t understand why you’re interpreting the robot as PCT and not behaviorist. From the program, it seems pretty clearly (STIMULUS: see blue → RESPONSE: fire laser) to me.
Do you have GChat or any kind of instant messenger? I feel like real-time discussion might be helpful here, because I’m still not getting it.
Well, your robot example was an intuition pump constructed so as to be as close as possible to stimulus-response nature. If you consider something only slightly more complicated the distinction may become clearer: a room thermostat. Physically ripped out of its context, you can see it as a stimulus-response device. Temperature at sensor goes above threshold --> close a switch, temperature falls below threshold --> open the switch. You can set the temperature of the sensor to anything you like, and observe the resulting behaviour of the switch. Pure S-R.
In context, though, the thermostat has the effect of keeping the room temperature constant. You can no longer set the temperature of the sensor to anything you like. Put a candle near it, and the temperature of the rest of the room will fall while the sensor remains at a constant temperature. Use a strong enough heat source or cold source, and you will be able to overwhelm the control system’s efforts to maintain a constant temperature, but this fails to tell you anything about how the control system works normally. Do the analogous thing to a living organism and you either kill it or put it under such stress that whatever you observe is unlikely to tell you much about its normal operation—and biology and psychology should be about how organisms work, not how they fail under torture.
Did you know that lab rats are normally starved until they have lost 20% of their free-feeding weight, before using them in behavioural experiments?
Here’s a general block diagram of a control system. The controller is the part above the dotted line and its environment the part below (what would be called the plant in an industrial context). R = reference, P = perception, O = output, D = disturbance (everything in the environment besides O that affects the perception). I have deliberately drawn this to look symmetrical, but the contents of those two boxes makes its functioning asymmetrical. P remains close to R, but O and D need have no visible relationship at all.
When you are dealing with a living organism, R is somewhere inside it. You probably cannot measure it even if you know it exist. (E.g. just what and where, physically, is the set point for deep body temperature in a mammal? Not an easy question to answer.) You may or may not know what P is—what the organism is actually sensing. It is important to realise that when you perform an experiment on an animal, you have no way of setting P. All you can do is create a disturbance D that may influence P. D, from a behavioural point of view, is the “stimulus” and O, the creature’s action on its environment, is the “response”. the behaviourist description of the situation is this:
This is simply wrong. The system does not work like that and cannot be understood like that. It may look as if D causes O, but that is like thinking that a candle put in a certain place chills the room, a fact that will seem mysterious and paradoxical when you do not know that the thermostat is present, and will only be explained by discovering the actual mechanism, discarding the second diagram in favour of the first. No amount of data collection will help until one has made that change. This is why correlations are so lamentably low in psychological experiments.
No, I’ve never used any of those systems. I prefer a medium in which I can take my time to work out exactly what I want to say.
Okay, we agree that the simple robot described here is behaviorist and the thermostat is PCT. And I certainly see where you’re coming from with the rats being PCT because hunger only works as a motivator if you’re hungry. But I do have a few questions:
There are some things behaviorism can explain pretty well that I don’t know how to model in PCT. For example, consider heroin addiction. An animal can go its whole life not wanting heroin until it’s exposed to some. Then suddenly heroin becomes extraordinarily motivating and it will preferentially choose shots of heroin to food, water, or almost anything else. What is the PCT explanation of that?
I’m not entirely sure which correlation studies you’re talking about here; most psych studies I read are done in an RCT type design and so use p-values rather than r-values; they can easily end up with p < .001 if they get a large sample and a good hypothesis. Some social psych studies work off of correlations (eg correlation between being observer-rated attractiveness and observer-rated competence at a skill); correlations are “lamentably low” in social psychology because high level processes (like opinion formation, social interaction, etc.) have a lot of noise. Are there any PCT studies of these sorts of processes (not simple motor coordination problems) that have any higher correlation than standard models do? Any with even the same level of correlation?
What’s the difference between control theory and stimulus-response in a context? For example, if we use a simplified version of hunger in which the hormone leptin is produced in response to hunger and the hormone ghrelin is produced in response to satiety, we can explain this in two ways: the body is trying to PCT itself to the perfect balance of leptin and ghrelin, or in the context of the stimulus leptin the response of eating is rewarded and in the context of the stimulus ghrelin the response of eating is punished. Are these the same theory, or are there experiments that would distinguish between them? Do you know of any?
Does PCT still need reinforcement learning to explain why animals use some strategies and not others to achieve equilibrium? For example, when a rat in a Skinner box is hungry (ie its satiety variable has deviated in the direction of hunger), and then it presses a lever and gets a food pellet and its satiety variable goes back to its reference range, would PCTists explain that as getting rewarded for pressing the lever and expect it to press the lever again next time it’s hungry?
Rats don’t always choose drugs over everything else
Summary: An experimenter thought drug addiction in rats might be linked to being kept in distressing conditions, made a Rat Park to test the idea, and found that the rats in the enriched Rat Park environment ignored the morphine on offer.
EDIT: apparently the study had methodological issues and hasn’t been replicated, making the results somewhat suspect, as pointed out by Yvain below
I hate to admit I get science knowledge from Reddit, but the past few times this was posted there it was ripped apart by (people who claimed to be) professionals in the field—riddled with metholodogical errors, inconsistently replicated, et cetera. The fact that even its proponents admit the study was rejected by most journals doesn’t speak well of it.
I think it’s very plausible that situation contributes to addiction; we know that people in terrible situations have higher discount rates than others and so tend to short-term thinking that promotes that kind of behavior, and certainly they have fewer reasons to try to live life as a non-addict. But I think the idea that morphine is no longer interesting and you can’t become addicted when you live a stimulating life is wishful thinking.
Damn. Oh well, noted and edited in to the original comment.
Well, like I said, all I have to go on is stuff people said on Reddit and one failed replication study I was able to find somewhere by a grad student of the guy who did the original research. The original research is certainly interesting and relevant and does speak to the problems with a very reductionist model.
This actually gets to the same problem I’m having looking up stuff on perceptual control theory, which is that I expect a controversial theory to be something where there are lots of passionate arguments on both sides, but on both PCT and Rat Park, when I’ve tried to look them up I get a bunch of passionate people arguing that they’re great, and then a few scoffs from more mainstream people saying “That stuff? Nah.” without explaining themselves. I don’t know whether it’s because of Evil Set-In-Their-Ways Mainstream refusing to acknowledge the new ideas, or whether they’re just so completely missing the point that people think it’s not worth their while to respond. It’s a serious problem and I wish that “skeptics” would start addressing this kind of thing instead of debunking ghosts for the ten zillionth time.
Just a brief note to say that I do intend to get back to this, but I’ve been largely offline since the end of last week, and will be very busy at least until the end of this month on things other than LessWrong. I would like to say a lot more about PCT here than I have in the past (here, here, and in various comments), but these things take me long periods of concentrated effort to write.
BTW, one of the things I’m busy with is PCT itself, and I’ll be in Boulder, Colorado for a PCT-related meeting 28-31 July, and staying on there for a few days. Anyone around there then?
The PCT learning model doesn’t require reinforcement at the control level, as its model of memory is a mapping from reference levels to predicted levels of other variables. I.e., when the rat notices that the lever-pressing is paired with food, a link is made between two perceptual variables: the position of the lever, and the availability of food. This means that the rat can learn that food is available, even when it’s not hungry.
Where reinforcement is relevant to PCT is in the strength of the linkage and in the likelihood of its being recorded. If the rat is hungry, then the linkage is more salient, and more likely to be learned.
Notice though, that again the animal’s internal state is of primary importance, not the stimulus/response. In a sense, you could say that you can teach an animal that a stimulus and response are paired, but this isn’t the same as making the animal behave. If we starved you and made you press a lever for your food, you might do it, or you might tell us to fork off. Yet, we don’t claim that you haven’t learned that pressing the lever leads to food in that case.
(As Richard says, it’s well established that you can torture living creatures until they accede to your demands, but it won’t necessarily tell you much about how the creature normally works.)
In any case, PCT allows for the possibility of learning without “reinforcement” in the behaviorist sense, unless you torture the definition of reinforcement to the point that anything is reinforcement.
Regarding the leptin/ghrelin question, my understanding is that PCT as a psych-physical model primarily addresses those perceptual variables that are modeled by neural analog—i.e., an analog level maintained in a neural delay loop. While Powers makes many references to other sorts of negative feedback loops in various organisms from cats to E. coli, the main thrust of his initial book deals with building up a model of what’s going on, feedback-loopwise, in the nervous system and brain, not the body’s endocrine systems.
To put it another way, PCT doesn’t say that control systems are universal, only that they are ubiquitous, and that the bulk of organisms’ neural systems are assembled from a relatively small number of distinct component types that closely resemble the sort of components that humans use when building machinery.
IOW, we should not expect that PCT’s model of neural control systems would be directly applicable to a hormone level issue. However, we can reason from general principles and say that one difference between a PCT model of the leptin/ghrelin question is that PCT includes an explicit model of hierarchy and conflict in control networks, so that we can answer questions about what happens if both leptin and ghrelin are present (for example).
If those signals are at the same level of control hierarchy, we can expect conflict to result in oscillation, where the system alternates between trying to satisfy one or the other. Or, if they’re at different levels of hierarchy, then we can expect one to override the other.
But, unlike a behavioral model where the question of precedence between different stimuli and contexts is open to interpretation, PCT makes some testable predictions about what actually constitutes hierarchy, both in terms of expected behavior, and in terms of the physical structure of the underlying control circuitry.
That is, if you could dissect an organism and find the neurons, PCT predicts a certain type of wiring to exist, i.e., that a dominant controller will have wiring to set the reference levels for lower-level controllers, but not vice-versa.
Second, PCT predicts that a dominant perception must be measured at a longer time scale than a dominated one. That is, the lower-level perception must have a higher sampling rate than the higher-level perception. Thus, for example, as a rat becomes hungrier (a longer-term perceptual variable), its likelihood of pressing a lever to receive food in spite of a shock is increased.
AFAICT, behaviorism can “explain” results like these, but does not actually predict them, in the sense that PCT is spelling out implementation-level details that behaviorism leaves to hand-waving. IOW, PCT is considerably more falsifiable than behaviorism, at least in principle. Eventually, PCT’s remaining predictions (i.e., the ones that haven’t already panned out at the anatomical level) will either be proven or disproven, while behaviorism doesn’t really make anatomical predictions about these matters.
To answer question 3, one could perform the experiment of surgically balancing leptin and ghrelin and not feeding or otherwise nourishing the subject. If the subject eventually dies of starvation, I would say the second theory is more likely.
Outstanding comment—particularly the point at the end about the candle cooling the room.
It might be worthwhile to produce a sequence of postings on the control systems perspective—particularly if you could use better-looking block diagrams as illustrations. :)
My interpretation of this interaction (which is fascinating to read, btw, because both of you are eloquently defending a cogent and interesting theory as far as I can tell) is that you’ve indirectly proposed Robot-1 as the initial model of an agent (which is clearly not a full model of a person and fails to capture many features of humans) in the first of a series of articles. I think Richard is objecting to the connections he presumes that you will eventually draw between Robot-1 and actual humans, and you’re getting confused because you’re just trying to talk about the thing you actually said, not the eventual conclusions he expects you to draw from your example.
If he’s expecting you to verbally zig when you’re actually planning to zag and you don’t notice that he’s trying to head you off at a pass you’re not even heading towards, its entirely reasonable for you to be confused by what he’s saying. (And if some of the audience also thinks you’re going to zig they’ll also see the theory he’s arguing against, and see that his arguments against “your predicted eventual conclusions” are valid, and upvote his criticism of something you haven’t yet said. And both of you are quite thoughtful and polite and educated so its good reading even if there is some confusion mixed into the back and forth.)
The place I think you were ambiguous enough to be misinterpreted was roughly here:
You use the phrase “human level intelligence” and talk about the robot making the same fuzzy inferential leap that outside human observer’s might make. Also, this is remarkably close to how some humans with very poor impulse control actually seem to function, modulo some different reflexes and an moderately unreasonable belief in their own deliberative agency (a la Blindsight with the “Jubyr fcrpvrf vf ntabfvp ol qrsnhyg” line and so on).
If you had said up front that you’re using this as a toy model which has (for example) too few layers and no feedback from the “meta-observer” module to be a honestly plausible model of “properly functioning cohesively agentive mammals” I think Richard would not have made the mistake that I think he’s making about what you’re about to say. He keeps talking about a robust and vastly more complex model than Robot-1 (that being a multi-layer purposive control system) and talking about how not just hypothetical PCT algorithms but actual humans function and you haven’t directly answered these concerns by saying clearly “I am not talking about humans yet, I’m just building conceptual vocabulary by showing how something clearly simpler might function to illustrate mechanistic thinking about mental processes”.
It might have helped if you were clear about the possibility that Robot-1 would emit words more like we might expect someone to emit several years after a serious brain lesion that severed some vital connections in their brain, after they’re verbal reasoning systems had updated on the lack of a functional connection between their conscious/verbal brain parts and their deeper body control systems. Like Robot-1 seems likely to me to end up saying something like “Watch out, I’m not just having a mental breakdown but I’ve never had any control over my body+brainstems’s actions in the first place! I have no volitional control over my behavior! If you’re wearing blue then take off the shirt or run away before I happen to turn around and see you and my reflex kicks in and my body tries to kill you. Dear god this sucks! Oh how I wish my mental architecture wasn’t so broken...”
For what its worth, I think the Robot-1 example is conceptually useful and I’m really looking forward to seeing how the whole sequence plays out :-)
I suspect Richard would say that the robot’s goal is minimizing its perception of blue. That’s the PCT perspective on the behavior of biological systems in such scenarios.
However, I’m not sure this description actually applies to the robot, since the program was specified as “scan and shoot”, not “notice when there’s too much blue and get rid of it.”. In observed biological systems, goals are typically expressed as perception-based negative feedback loops implemented in hardware, rather than purely rote programs OR high-level software algorithms. But without more details of the robot’s design, it’s hard to say whether it really meets the PCT criterion for goals.
Of course, from a certain perspective, you could say at a high level that the robot’s behavior is as if it had a goal of minimizing its perception of blue. But as your post points out, this idea is in the mind of the beholder, not in the robot. I would go further as to say that all such labeling of things as goals occurs in the minds of observers, regardless of how complex or simple the biological, mechanical, electronic, or other source of behavior is.
This ‘minimization’ goal would require a brain that is powerful enough to believe that lasers destroy or discolor what they hit.
If this post were read by blue aliens that thrive on laser energy, they’d wonder they we were so confused as to the purpose of a automatic baby feeder.
From the PCT perspective, the goal of an E. coli bacterium swimming away from toxins and towards food is to keep its perceptions within certain ranges; this doesn’t require a brain of any sort at all.
What requires a brain is for an outside observer to ascribe goals to a system. For example, we ascribe a thermostat’s goal to be to keep the temperature in a certain range. This does not require that the thermostat itself be aware of this goal.
< If this post were read by blue aliens that thrive on laser energy, they’d wonder they we were so confused as to the purpose of a automatic baby feeder.
Clever!
Although I find PCT intriguing, all the examples of it I’ve found have been about simple motor tasks. I can take a guess at how you might use the Method of Levels to explain larger-level decisions like which candidate to vote for, or whether to take more heroin, but it seems hokey, I haven’t seen any reputable studies conducted at this level (except one, which claimed to have found against it) and the theory seems philosophically opposed to conducting them (they claim that “statistical tests are of no use in the study of living control systems”, which raises a red flag large enough to cover a small city)
I’ve found behaviorism much more useful for modeling the things I want to model; I’ve read the PCT arguments against behaviorism and they seem ill-founded—for example, they note that animals sometimes auto-learn and behaviorist methodological insistence on external stimuli shouldn’t allow that, but once we relax the methodological restrictions, this seems to be a case of surprise serving the same function as negative reinforcement, something which is so well understood that neuroscientists can even point to the exact neurons in charge of it.
Richard’s PCT-based definition of goal is very different from mine, and although it’s easily applicable to things like controlling eye movements, it doesn’t have the same properties as the philosophical definition of “goal”, the one that’s applicable when you’re reading all the SIAI work about AI goals and goal-directed behavior and such.
By my definition of goal, if the robot’s goal were to minimize its perception of blue, it would shoot the laser exactly once—at its own visual apparatus—then remain immobile until turned off.
Ironically, quite a lot of human beings goals would be more easily met in such a way, and yet we still go around shooting our lasers at blue things, metaphorically speaking.
Or, more to the point, systems need not efficiently work towards their goals’ fulfillment.
In any case, your comments just highlight yet again the fact that goals are in the eye of the beholder. The robot is what it is and does what it does, no matter what stories our brains make up to explain it.
(We could then go on to say that our brains have a goal of ascribing goals to things that appear to be operating of their own accord, but this is just doing more of the same thing.)
Can you spell out the philosophical definition? My previous comment, which I posted before reading this, made only a vague guess at the concept you had in mind: “this sort of conscious, reflective, adaptive attempt to achieve what we ‘really’ want”.
I think we agree, especially when you use the word “reflective”. As opposed to, say, a reflex, which is an unconscious, nonreflective effort to acheive something which evolution or our designers decided to “want” for us. When the robot’s reflection that shooting the hologram projector instead of the hologram fails to motivate it to do so, I start doubting its behaviors are goal-driven, and suspecting they’re reflexive.
Every time you bring up PCT, I have to bring up my reasons for concluding that it’s pseudoscience of the worst sort. (Note that this is an analysis of an experiment that PJ Eby himself picked to support his claims.)
Actually, Yvain brought it up.
Which linking I don’t mind a bit, since you’re effectively linking to my reply as well, which is then followed by your hasty departure from the thread with a claim that you’d answer my other points “later”… with no further comment for just under two years. Guess it’s not “later” yet.. ;-)
(Also, anyone who cares to read upthread from that link can see where I agreed with you about Marken’s paper, or how much time it took me to get you to state your “true rejection” before you dropped out of the discussion. AFAICT, you were only having the discussion so you could find ammunition for a conclusion you’d reached long before that point.)
You also seem to have the mistaken notion that I’m an idea partisan, i.e., that because I say an idea has some merit or that it isn’t completely worthless, that this means I’m an official spokesperson for that idea as well, and therefore am an Evil Outsider to be attacked.
Well, I’m not, and you’re being rude. Not only to me, but to everyone in the thread who’s now had to listen to both your petty hit-and-run pa(troll)ing, and to me replying.
So, I’m out of here (the subthread), but I won’t be coming back later to address any missed points, since the burden is still on you to actually address any of the many, MANY questions I asked you in that two-year-old thread, for which you still have yet to offer any reply, AFAICT.
I entered that discussion with a willingness to change my mind, but from the evidence at hand, it seems you did not.
(Note: if you do wish to have an intelligent discussion on the topic, you may reach me via the old thread. I’m pre-committing not to reply to you in this one, where you can indulge your obvious desire to score points off an audience, vs. actually discussing anything.)
Thanks for the poisoned well, but I don’t intend to abuse the last word. I think more highly of you now than I did when we had our prior altercation, but it remains true that I’ve seen zero experimental evidence for PCT in a cognitive context, and that Marken’s paper is an absolute mathematical sham. There may be valid aspects to PCT, but it hasn’t yet justified its use as a cognitive theory, and I feel that it’s important to note this whenever it comes up on Less Wrong.
(Incidentally, the reason I trailed off in that thread is because I’d done something that in retrospect was poor form: I’d written up a full critique of the Marken paper before I asked you whether you thought it constituted experimental evidence, and I was frustrated that you didn’t walk into the trap. If we both agree that the paper is pseudoscience, though, there’s nothing left to add.)
P.S. I don’t doubt that you’ve had success working with people through a PCT framework, but I suspect that it’s a placebo effect: a sufficiently fuzzy framework gives you room to justify your (usually correct) unconscious intuitions about what’s going on, and grants it the gravitas of a deep-sounding theory. (You might do just as well if you were a Freudian.) That’s one reason why I discount anecdotal evidence of that form.
I recall that a big problem we had before was trying to unpack what different people meant by the words “goal”, “model”, etc. But your description of at least this distinction you’re drawing between the things which you’re calling “goals” and the things which you’re calling “programs” is very good, IMO!
This robot is not a consequentialist—it doesn’t have a model of the world which allows it to extrapolate (models of) outcomes that follow causally from its choices. It doesn’t seem to steer the universe any particular place, across changes of context, because it explicitly doesn’t contain a future-steering engine.
Heh, it’s pretty much exactly what I said.
What exactly is meant by the robot having a human-level intelligence? Does it have two non-interacting programs: shoot blue and think?
This seems to be the key point. Everything interesting about the whole project of human rationality is contained in the interaction between the parts of us that think and the parts of us that do. All of the theories Yvain is criticising are about, ultimately, explaining and modeling the relationship between these two entities
Ah, excellent. This post comes at a great time. A few weeks ago, I talked with someone who remarked that although decision theory speaks in terms of preferences and information being separate, trying to apply that into humans is fitting the data to the theory. He was of the opinion that humans don’t really have preferences in the decision theoretic sense of the word. Pondering that claim, I came to the conclusion that he’s right, and have started to increasingly suspect that CEV-like plans to figure out the “ultimate” preferences of people are somewhat misguided. Our preferences are probably hopelessly path-, situation- and information-dependent. Which is not to say that CEV would be entirely pointless—even if the vast majority of our “preferences” would never converge, there might be some that did. And of course, CEV would still be worth trying, just to make sure I’m not horribly mistaken on this.
The ease at which I accepted the claim “humans don’t have preferences” makes me suspect that I’ve myself had a subconscious intuition to that effect for a long time, which was probably partially responsible for an unresolved disagreement between me and Vladimir Nesov earlier.
I’ll be curious to hear what you have to say.
This is off-topic but since you mentioned it and since I don’t think it warrants a new post, here are my latest thoughts on CEV (a convergence of some of my recent comments originally posted as a response to a post by Michael Anissimov):
Consider the difference between a hunter-gatherer, who cares about his hunting success and to become the new clan chief, and a member of lesswrong who wants to determine if a “sufficiently large randomized Conway board could turn out to converge to a barren ‘all off’ state.”
The utility of the success in hunting down animals and proving abstract conjectures about cellular automata is largely determined by factors such as your education, culture and environmental circumstances. The same hunter gatherer who cared to kill a lot of animals, to get the best ladies in its clan, might have under different circumstances turned out to be a vegetarian mathematician solely caring about his understanding of the nature of reality. Both sets of values are to some extent mutually exclusive or at least disjoint. Yet both sets of values are what the person wants, given the circumstances. Change the circumstances dramatically and you change the persons values.
You might conclude that what the hunter-gatherer really wants is to solve abstract mathematical problems, he just doesn’t know it. But there is no set of values that a person “really” wants. Humans are largely defined by the circumstances they reside in. If you already knew a movie, you wouldn’t watch it. To be able to get your meat from the supermarket changes the value of hunting.
If “we knew more, thought faster, were more the people we wished we were, and had grown up closer together” then we would stop to desire what we learnt, wish to think even faster, become even different people and get bored of and rise up from the people similar to us.
A singleton will inevitably change everything by causing a feedback loop between the singleton and human values. The singleton won’t extrapolate human volition but implement an artificial set values as a result of abstract high-order contemplations about rational conduct. Much of our values and goals, what we want, are culturally induced or the result of our ignorance. Reduce our ignorance and you change our values. One trivial example is our intellectual curiosity. If we don’t need to figure out what we want on our own, our curiosity is impaired.
Knowledge changes and introduces terminal goals. The toolkit that is called ‘rationality’, the rules and heuristics developed to help us to achieve our terminal goals are also altering and deleting them. A stone age hunter-gatherer seems to possess very different values than I do. If he learns about rationality and metaethics his values will be altered considerably. Rationality was meant to help him achieve his goals, e.g. become a better hunter. Rationality was designed to tell him what he ought to do (instrumental goals) to achieve what he wants to do (terminal goals). Yet what actually happens is that he is told, that he will learn what he ought to want. If an agent becomes more knowledgeable and smarter then this does not leave its goal-reward-system intact if it is not especially designed to be stable. An agent who originally wanted to become a better hunter and feed his tribe would end up wanting to eliminate poverty in Obscureistan. The question is, how much of this new “wanting” is the result of using rationality to achieve terminal goals and how much is a side-effect of using rationality, how much is left of the original values versus the values induced by a feedback loop between the toolkit and its user?
Take for example an agent is facing the Prisoner’s dilemma. Such an agent might originally tend to cooperate and only after learning about game theory decide to defect and gain a greater payoff. Was it rational for the agent to learn about game theory, in the sense that it helped the agent to achieve its goal or in the sense that it deleted one of its goals in exchange for a more “valuable” goal?
It seems to me that becoming more knowledgeable and smarter is gradually altering our utility functions. But what is it that we are approaching if the extrapolation of our volition becomes a purpose in and of itself? A living treaty will distort or alter what we really value by installing a new cognitive toolkit designed to achieve an equilibrium between us and other agents with the same toolkit.
Would a singleton be a tool that we can use to get what we want or would the tool use us to do what it does, would we be modeled or would it create models, would we be extrapolating our volition or rather follow our extrapolations?
Is becoming the best hunter really one of the primitive man’s terminal values? I would say his terminal values are more things like “achieving a feeling of happiness, contentment, and pride in one’s self and one’s relatives”. The other things you mention are just effective instrumental goals.
I mostly agree with this.
I think that the idea of desires converging if “we knew more, thought faster, were more the people we wished we were, and had grown up closer together” relies on assumptions of relatively little self-modification. Once we get uploads and the capability for drastic self-modification, all kinds of people and subcultures will want to use it. Given the chance and enough time, we might out-speciate the beetle (to borrow Anders Sandberg’s phrase), filling pretty much every corner of posthuman mindspace. There’ll be minds so strange that we won’t even recognize them as humans, and we’ll hardly have convergent preferences with them.
Of course, that’s assuming that no AI or mind with a first-mover advantage simply takes over and outcompetes everyone else. Evolutionary pressures might prune the initial diversity a lot, too—if you’re so alien that you can’t even communicate with ordinary humans, you may have difficulties paying the rent for your server farm.
At the end of this, I’m going to try to argue that something like CEV is still justified. Before I started thinking it through I was hoping that taking an eliminativist view of preferences to its conclusion would help tie up the loopholes in CEV, and so far it hasn’t done that for me, but it hasn’t made it any harder either.
CEV has worse problems that worries about convergence. The big one is that it’s such a difficult thing to implement that any AI capable of doing so has already crossed the threshold of extremely dangerous transhuman capability, and there’s no real solution to how to regulate its behavior while it’s in the process of working on the extrapolation. It could very well turn the planet into computronium before it gets a satisfactory implementation, by which point it doesn’t much matter what result it arrives at.
Presumably it matters if it then turns the planet back?
Even if you’re the type who thinks a Star Trek transporter is a transportation device rather than a murder+clone system, there’s no reason to think the AI would have detailed enough records to re-create everyone. Collecting that level of information would be even harder than getting enough to extrapolate CEV.
So I suppose it might matter to the humanity it re-creates, assuming it bothers. But we’d all still be dead, which is a decidedly suboptimal result.
Well, a neverending utopia fit to the exact specifications of humanity’s CEV is still pretty darn good, all things considered.
Related: I recommend to those who think that CEV is insufficiently meta that they read CFAI, and try to go increasingly meta from there instead. Expanding themes from CFAI to make them more timeless is also recommended; CFAI is inherently more timeless than CEV—that’s semi-personal jargon but perhaps the gist is sufficiently hinted at. Note that unlike metaness, timelessness is often just a difference of perspective or emphasis. I assert that CEV is a bastardized popularization of the more interesting themes originally presented in CFAI, and should not be taken very seriously. CFAI shouldn’t either—most of it is useless—but it at least highlights some good intuitions. Edit: I do not mean to recommend proposing solutions or proposing not-solutions, I recommend the meta-level strategy of understanding and developing intuitions and perspectives.
Agreed. But what kind of mistake was that?
Is “This robot is a blue-minimizer” a false statement? I think not. I would classify it as more like the unfortunate selection of the wrong Kuhnian paradigm for explaining the robot’s behavior. A pragmatic mistake. A mistake which does not bode well for discovering the truth, but not a mistake which involves starting from objectively false beliefs.
Why does the human-level intelligence component of the robot care about blue? It seems to me that it is mistaken in doing so. If my motor cortex was replaced by this robot’s program, I would not conclude that I had suddenly started to only care about blue, I would conclude that I had lost control of my motor cortex. I don’t see how it makes any difference that the robot always had it actions controlled by the blue-minimizing program. If I were the robot then, upon being informed about my design, I would conclude that I did not really care about blue. My human-level intelligence is the part that is me and therefore contains my preferences, not my motor cortex.
I predict this would not happen the way you anticipate, at least for some ways to cash out ‘taking control of your motor cortex’. For example, when a neurosurgeon uses a probe to stimulate a part of the motor cortex responsible for moving the arm, and eir patient’s arm moves, and the neurosurgeon asks the patient why ey moved eir arm, the patient often replies something like “I had an itch”, “it was uncomfortable in that position”, or “What, I’m not allowed to move my arm now without getting grilled on it?”
Or in certain forms of motor cortex damage in which patients can’t move their arm, they explain it by saying “I could move my arm right now, I just don’t feel like it” or “That’s not even my real arm, how could you expect me to move that?”.
Although I won’t get there in a while, part of my thesis for this sequence is that we infer our opinions from our behaviors, although it’s probably more accurate to say that our behaviors feed back to the same processes that generate our opinions and can alter them. If this is true, then there are probably very subtle ways of taking control of your motor cortex that would leave your speech centers making justifications for whatever you did.
I’d be very surprised if this worked on me for more than, say, a day. Even if the intuition that I’m the one in control doesn’t go away, I expect to eventually notice that it’s actually false and consciously choose to not take it into account, at least in verbal reasoning. Has it been tried (on someone more qualified than a random patient)? If it doesn’t work, the effect should be seen as rather more horrible than just overriding one’s limb movement.
Interesting. It’s been two days since your cortex has been swapped out, and you seem not to have noticed. Affect slight annoyance at what you will believe to be my weak attempt at humor. This will enable you to produce a consistent, rational narrative which will satisfy your pattern-recognition wetware. That will use less glucose, making you more energy-efficient. People who care about the environment care about energy efficiency, you know. You do care about the environment, don’t you? Of course you do. So, let’s get to it now...
I’ve got pretty bad akrasia. I want to do things, but then I do other things. Intuitively, I feel like I’m me, and I’m in control.
Rationally, not so much. Rationally speaking the answer to “Am I in control?” depends a lot on how broad you define “I”. Is my rational mind in control? No way. Is my brain as a whole in control? Yeah, mostly.
Do excuses automatically pop up when I avoid work? Definitely. “I wanted to relax.” “I got distracted.” “I hate working.” Having some rationality allows me to see through them though, which I presume puts me in the “someone more qualified than a random patient” category you mention.
I’m not sure if this is exactly what Yvain is referring to, I just want to shine a light on the matter from a different angle.
I find this curious. When a physician taps my knee with a knee tapping hammer I don’t think to myself “I chose to jerk my leg.” I experience it as something out of my control.
Perhaps endoself was mistaken in placing the robots programing in the motor complex, but I believe the point was that in the human experience there are two kinds of reactions: those we have at least some form of conscious control over and those we have no conscious control over; and the robot’s blue minimizing programing would fall into the later. Thus the robot would not experience the blue minimization as anything other than a strange reflex triggered by the color blue.
Without overfitting, the robot has the goal of shooting at what it sees blue. It achieves its goal. What I get from the article is that the human intelligence mis interprets the goal. Here I see the definition of a goal to equal what the program is written to do, hence it seems inevitable that the robot wll achieve its goal (if there is a bug in the code that misses shooting a blue object every 10 days, then this should be considered part of the goal as well, since we are forced to define the goal in hindsight, if we have to define one)
Do you reason similarly for humans?
It is almost proverbial that intentions are better revealed by deeds than by words.
You’re right. Man, I can’t believe I’ve been wasting my time persuading people to sign up for cryonics, because if they wanted to live, they would have already done so! I can’t believe I didn’t realize this before.
I can’t think of another way to reason—does our brain dictate our goal, or receives a goal from somewhere and makes an effort to execute it accurately? I’d go with the first option, which to me means that whatever our brain (code) is built to do is our goal.
The complication in the case of humans might be the fact that we have more than one competing goal. It is as if this robot has a multi-tasking operating system, with one process trying to kill blue objects and another trying to build a pyramid out of plastic bottles. Normally they can co-exist somehow with some switching between processes or by just one process “not caring” about doing some activity at the current instance.
It gets ugly when the robot finds a few blue bottles. Then the robot becomes “irrational” with one process destroying what the other is trying to do. This is simply when you are on a healthy diet and see a slice of chocolate cake—you’re processes are doing their jobs, but they are competing on resources—who gets to move your arms?
Let’s then imagine that we have in our brains a controlling (operating) system that can get to decide which process to kill when they are in conflict. Will this operating system have a right and wrong decision? Or will whatever it does be the right thing according to its code—or else it wouldn’t have done it?
I was thinking of, y’know, biased people. Also known as me and everyone I’ve ever met. Telling them “don’t worry, whatever you’re already doing is what you really want” does not seem Friendly.
Does that “human level intelligence module” have any ability to actually control the robot’s actions, or just to passively observe and then ask “why did I do that?” What’re the rules of the game, as such, here?
I don’t think it’s saying anything too shocking to admit this is all a metaphor for people; I’m going to be pushing the view that people’s thoughts and words are a byproduct of the processes that determine behavior rather than directly controlling them. I anticipate providing at least a partial answer to your question in about two weeks; if that doesn’t satisfy you, let me know and we can talk about it then.
One that presents consciousness as an epiphenomenon. In the version of the robot that has human intelligence, you describe it as bolted on, experiencing the robot’s actions but having no causal influence on them, an impotent spectator.
Are your projected postings going to justify this hypothesis?
I hope so. Let’s see.
My first thought was that this was pointing towards an epiphenomenal view of consciousness. But I think it’s actually something more radical and more testable. Yvain, check me if I get this wrong, but I think you’re saying that “our conscious verbal acts—both internally and externally directed—do not primarily cause our actions.”
Here is an experiment to test this: have people perform some verbal act repeatedly, and see if it shifts their actions. This happens to be a well known motivational and behavior-alteration technique, beloved of football teams, political campaigns, governments, and religions, among other organizations. My impression is that it works to a point, but not consistently. Has anybody done a test of how catechisms, chants, and the like shape behavior?
I hope that he explicitly deals with this. By the way, I didn’t know the actual definition of epiphenomenon, which is ” is a secondary phenomenon that occurs alongside or in parallel to a primary phenomenon”.
But then again...
See also.
I take this to be an elliptical way of suggesting that Yvain is offering a false dichotomy in suggesting a choice between the notion of thoughts being in control of the processes determining behavior and the notion of thoughts being a byproduct of those processes.
I agree. Thoughts are at one with (are a subset of) the processes that determine behavior.
I’m not so sure. Using the analogy of a computer program, we could think of thoughts either as like the lines of code in the program (in which case they’re at one with, or in control of, the processes generating behavior, depending on how you want to look at it) or you could think of thoughts as like the status messages that print “Reticulating splines” or “50% complete” to the screen, in which case they’re byproducts of those processes (very specific, unnatural byproducts, to boot).
My view is closer to the latter; they’re a way of allowing the brain to make inferences about its own behavior and to communicate those inferences. Opaque processes decide to go to Subway tonight because they’ve heard it’s low calorie, then they produce the verbal sentence “I should go to Subway tonight because it’s low calorie”, and then when your friend asks you why you went to Subway, you say “Because it’s low calorie”).
The tendency of thoughts to appear in a conversational phrasing (“I think I’ll go to Subway tonight”) rather than something like “Dear Broca’s Area—Please be informed that we are going to Subway tonight, and adjust your verbal behavior accordingly—yours sincerely, the prefrontal cortex” is a byproduct of their use in conversation, not their internal function.
Right now I’m just asserting that this is a possibility and that it’s distinct from thoughts being part of the decision-making structure. I’ll try to give some evidence for it later.
If you make the old mistake of confusing thoughts in general with analytic, reflective, verbal, serial internal monologue, I’m going to be sad.
I find this rather alien. Some processes are opaque, but that kind definitely isn’t. Something (hunger, time, memory of previously made plans, whatever) triggers a reusable pick-a-sandwich-shop process; names and logos of nearby stores come up; associated emotions and concepts come up; weights associated to each shift—an image of those annoying health freaks who diet all the time upvotes “tasty” and downvotes “low calorie”; eventually they stabilize, create an image of myself going to Subway rather than somewhere else, and hand it over to motor control. If something gets stuck at any point, the process stops, a little alarm rings, and internal monologue turns to it to make it come unstuck. If not, there are no verbal thoughts at any point.
Probably time to start being sad; I’m mostly going to use “thoughts” that way. But I think what I’m talking about holds for any definition of “thought” where it’s a mental activity accessible to the conscious mind.
I recognize different people use internal monologue to a different degree than others, but whether you decide with a monologue or with vague images of concepts, I think the core idea that these are attempts to turn subjective processes into objects for thought, usually so that you can weave a social narrative around them, remains true.
You may have missed a subtlety in my comment. In your grandparent, you said “people’s thoughts and words are a byproduct …”. In my comment, I suggested “Thoughts are at one with …”. I didn’t mention words.
If we are going to focus on words rather than thoughts, then I am more willing to accept your model. Spoken words are indeed behaviors—behaviors that purport to be accurate reports of thoughts, but probably are not.
Perhaps we should taboo “thought”, since we may not be intending the word to designate the same phenomenon.
What I meant was that if the intelligence part is utterly passive re the behavior, then I’m unsure as to how strong a metaphor it is for human behavior. Yes, we sometimes don’t know why we do things, or we have inaccurate models of ourselves as far as our explanations of why we do what we do. But the “intelligence part” having absolutely no affect on the actions?
But, perhaps all will be resolved in two weeks. :)
Robin Hanson would also argue that the robot wants to minimize things that are ‘far’, since blue is ‘far’
One of the obvious extensions of this thought experiment is to posit a laser-powered blue goo that absorbs laser energy, and uses it to grow larger.
This thought experiment also reminds me: Omohundro’s arguments regarding likely uFAI behavior are based on the AI having goals of some sort—that is, something we would recognize as goals. It’s entirely possible that we wouldn’t perceive it as having goals at all, merely behavior.
Please share your thoughts more often, that wasn’t obvious to me at all.
I not unoften say things I think are obvious and get surprised by extreme positive or negative reactions to them.
If is isn’t goal directed, it isn’t intelligent. Intelligence is goal-directed—by most definitions.
To the extent that intelligence solves problems, then yes, problem-solving modules of an intelligent entity have at least short-term goals. But whether the entity itself has goals is a different question.
I can imagine a sufficiently dangerous uFAI having no consistent goals we would recognise, and all of its problem-solving behaviour, however powerful and adaptable, being at a level we wouldn’t call intelligence (e.g. the uFAI could include a strategic module like a super-powerful but clearly non-sentient chess computer, which would not in itself have any awareness or intent, just problem-solving behaviour.)
Actually, I’m not going to disagree with you about definitions of intelligence. But I suspect most of them are place-holders until we understand enough to dissolve the question “what is intelligence?” a bit.
Be suspicious of arguments from definition. Why must intelligence be goal-directed? Why is this an integral part of the definition of intelligence, if it must be?
Is there a sense in which we can conclude that the robot is a blue-minimizing robot, in which we can’t also conclude that it’s an object-minimizing robot that happens to be optimized for situations where most of the objects are blue or most of the backgrounds are non-blue? (Perhaps it’s one of a set of robots, or perhaps the ability to change its filter is an intentional feature.)
A couple of points here. First, as other people seem to have indicated, there does seem to be a problem with saying ‘the robot has human level intelligence/self-reflective insight’ and simultaneously that it carries out its programming with regards to firing lasers at percepts which appear blue unreflectively, in so far as it would seem that the former would entail that the latter would /not/ be done unreflectively. What you have here are two seperate and largely unintegrated cognitive systems one of which ascribes functional-intentional properties to things, including the robot, and has human level intelligence on the one hand and the robot on the other.
The second point is that there may be a confusion based upon what your functional ascriptions to the robot are tracking. I want to say that objects have functions only relative to a system in which they play a role, which means that for example the robot might have ‘the function’ of eliminating blue objects within the wider system which is the Department of Homeland Security, however there is no discoverable fact about the robot which describes it’s ‘function simpliciter’. You can observe what appears to be goal directed behaviour, of course, but your ascriptions of goals to the robot are only good in so far as they serve to predict future behaviour of the robot (this is a standard descriptivist/projectivist approach to mental content ascription, of the sort Dennett describes in ‘The Intentional Stance’). So when you insert the investing in front of the robots camera it ceases to exercise the same goal directed behaviour or (what amounts to the same, expressed differently) your previous goal ascriptions to the object cease to be able to make reliable predictions of the robots future behaviour and need to be corrected. ((I’m going to ignore the issue of the ontological status of these ascriptions. If this is an interest you happen to have Dennett discusses his views on the subject in an essay entitled ‘Real Patterns’ and there is further commentary in a couple of the articles in Ross and Brook’s ’Dennett’s Philosophy)).
I realise you are consciously using a naive version of behaviourism as the backdrop of your discussion, so it’s possible that I’m just jumping ahead to ‘where you’re going with this’ but it does seem that with subsequent post-behaviourist approaches to mental content ascription the puzzle you seem to be describing of how to correctly describe the robot dissolves. ((N.B. - You might want to Millar’s Understanding People which surveys a broad range of the various approaches to mental state ascription)).
Shouldn’t the human intelligence part be considered part of the source code? With its own goals/value functions? Otherwise it will be just a human watchig the robot kind of thing.
I wonder how many people upvoted this post less for the ideas expressed and more because they like robots.
I think I upvoted it for the ideas, but can’t honestly guarantee that the “oooh, shiny cool robot analogy appeals to my geeky heart” factor wouldn’t have made me upvote even if I found the ideas uninteresting.
I wonder how many people upvoted this post less for the ideas expressed and more because they like Yvain.
I like how Yvain creates clarity. This takes a lot of effort. I’d like to encourage his effort.
I upvote articles and comments that promise future articles that I want to read.
ditto :) I never have enough reading material.
I up-voted because:
Writing-style gives me a strong internal impression of clarity / comprehension. I enjoy this sensation and think it correlates with understanding, and so am trying to promote more of it.
Blue feels very soothing and I think contributes to the sensation of clarity / comprehension. I up-vote soothing sensation.
I like the possible direction this could take in terms of microeconomic utility, revealed preferences, and so on for understanding human intelligence. So my up-vote is payment for expected future ideas.
I haven’t thought much about why it feels like we have goals (desires), so I look forward to that! I do think it’s quite possible that eliminativism about ‘beliefs’ and ‘desires’ will turn out to be the best way to go. Certainly, the language of ‘reinforcers’ fits well with our understanding of the reward-learning system in the brain.
Chapter 2 of Gary Drescher’s Good and Real attempts to address why it feels like we have goals. If the robot could learn to destroy hologram projectors and adapt to color-inverting lenses, the assertion “the robot wants to minimize blue objects” would have some explanatory virtue.
Yes, ‘folk psychology’ (beliefs+desires=intentional action) is a compelling theory because it works so successfully in everyday social interactions, but I’m wondering if Yvain has something more planned for why it feels (inside) like we have goals. My guess is that it’s because we use the same folk theory to explain our own behavior as to explain others’ behavior, but I’m wondering if Yvain has something else in mind.
Nope, pretty much that. I’ll be presenting a few studies to justify it, but I’m sure you’ve seen them before.
Ah. Well. They are fun studies. :)
It’s not obvious that folk psychology works well at all except regarding motor actions.
I consider all of the behaviors you describe as basically transform functions. In fact, I consider any decision maker a type of transform function where you have input data that is run through a transform function (such as a behavior-executor, utility-maximizer, weighted goal system, a human mind, etc.) and output data is generated (and in the case of humans sent to our muscles, organs, etc.). The reason I mention this is that trying to describe a human’s transform function (i.e., what people normally call their mind) as mostly a behavior-executor or just a utility-maximizer leads to problems. A human’s transform function is enormously complex and includes both behavior execution aspects and utility maximization aspects. I also find that attempts to describe a human’s transform function as ‘basically a __’ results in a subsequent failure to look at the actual transform function when trying to figure out how people will behave.
Is “transform function” a technical term from some discipline I’m unfamiliar with? I interpret your use of that phrase as “operation on some input that results in corresponding output.” I’m having trouble finding meaning in your post that isn’t redefinition.
Well-written and insightful. This post reminds me very much of the free will sequence, except that the reduction is on the level of genes and the adaptations they code for rather than physical laws. I look forward to seeing the rest of the sequence, and I’m interested to see how you will dissolve the feeling that our behavior and actions are goal-oriented.
The entire example is deeply misleading. We model the robot as a fairly stupid blue minimizer because this seems to be a good succinct description of the robots entire externally observable behavior and would cease to do so if it also had a speaker or display window with which it communicated it’s internal reflections.
So to retain the intuitive appeal of describing the robot as a blue minimizer the robots human level intelligence must be walled off inside the robot unable to effectively signal to the outside world. But so long as the human level intelligence is irrelevant to predicting the robots exterior behavior the blue minimizing model is an appropriate one to keep in mind to guide our interactions with the robot. That is like any good scientific model it provides good predictive power relative to it’s cost in mental (or simulating computer’s) effort and memory.
It’s pretty obvious why it’s useful to us to describe stuff in ways that lets us feasibly predict/approximate the behavior of external entities/effects we encounter. Perhaps though you are puzzled or arguing against the idea that belief-desire style models are often a good tradeoff between accuracy and ease of use. However, this is also easily explained as a result of evolutionary hard wiring that effectively functions as a hardware accelerator for belief desire models so we didn’t get eaten and the salience of objects designed by other humans to our lives. The acceleration means that even in domains were the fit is poor (like charges want to get away from each other) the ease of application still makes them a useful heuristic. Also since human made objects are usually built to achieve a particular goal that had to be represented usefully in the builders mind these objects usually offer the most effective behavior for accomplishing that goal relative to a given level of computational complexity.
In other words since the guy who builds the robot to lase the blue dyed cancer cells does so by coming up with a goal he wants the robot to achieve (discriminating between blue cells and other blue things is hard so we will just build a robot to fry all the blue things it sees) and then offering up the best implementation he can come up with given the constraints the resulting behavior can be well modeled as the object desiring some end but being stupid in various ways. In other words if you want to zap blue cells you don’t add extra code to 1 time in a million zap yellow nor would you tack on AI without needing it’s expertise to implement the desired behavior so the resulting behavior looks like a stupid creature trying to achieve the inventors chosen goal.
Interstingly I suspect that being well described by a belief-desire model probably simply corresponds to being on the set of non-dominated ways to achieve a goal people can reasonably conceptualize. Thus we see it all the time in evolution as we can easily understand both the species level goal of survival and individual level goals of avoiding suffering and satisfying some basic wants and natural selection ensures that the implementations we usually see are at least locally non-dominated (if you want to make a better hunter on the savannah than the lion you have to either jump to a whole new basic design or use a bigger computational/energy budget.)
EDIT: It just clicked after finishing my thought.
I was thrown off by all the comments about the robot and its behavior. This is more about the comparison of behavior-executor vs. utility-maximizer, not the robot. EDIT:
Perhaps I am missing the final direction of this conversation, but I think the intelligence involved in the example has mapped the terrain and failing to update the map once it has been seen to not be correct.
Correct, the robot is designed to shoot its laser at blue things.
The robot is failing at nothing. It is doing exactly as programmed. The laser, on the other hand, is failing to eliminate the blue object, as what is expected when the laser is shot at something. Now that this experiment has been conducted and the map has been found to be wrong, correct the map. It is no longer a Blue-Minimizing Robot, it is a Blue-Targeting-and-Shooting Robot. The result of the laser shot is variable.
The discussion has mapped it as a Blue-Minimizing Robot, it is not, as proven by this experiment. If this were to be made such, more programming would have to be implemented in-case of the laser not having the intended effect. Being there is not a way of altering the programming, there is no way of changing the terrain, so the map must be changed.
I don’t see that, why would it “care” if its goal isn’t complex enough to allow it to care about the subversion of its sensors? I mean, the level of intelligence seems irrelevant here. Intelligence isn’t even instrumental to such simple goals because all it “wants” is to fire a laser at blue objects. Its utility function says nothing about maximizing its efficiency or anything like that.
Unfortunately, your question is unanswerable, at least until Yvain invents the rest of the story. We haven’t been told what goals, if any, are embodied in the intelligent part of the code. Strike that “if any” part, though—I think we can infer that it has goals from the specification that it has human-level intelligence. And even infer something about what some of the goals are like (truth-seeking, for example).
We also haven’t been told the relationship between blue-zapping code and intelligence—whether it is physically possible, for example, for the intelligent processes to modify the blue-zapping code modules.
Edit: Psy-Kosh raised similar questions.
Take three common “broad” or “generally-categorizable” demographics of minds: Autistic people, Empaths (lots of mirror neurons dedicated to modeling the behavior of others), Sociopaths, or “Professional Psychopaths” (high-functioning without mirror neurons, responsible for most systemic destruction, precisely because they can appear to be “highly functional and productive well-respected citizens”), Psychopaths (low-functioning without mirror neurons, most commonly an “obvious problem”). All of the prior humans’ minds work on the principle of emergent order, with logic, reason, introspection being the alien or uncommon state of existence that is a minor veneer on the surface of the vast majority of brain function which is: “streaming perception and prediction of emergent patterns.”
A robot that never evolved to “get along” with other sentiences, and is programmed in a certain way can “go wrong” or have any of billions of irrational “blue minimizing” functions. It seems that sure, it’s a “behavior executor,” not a “utility maximizer.” I would go further and say that humans are not “ultility maximizers” either, except when they are training themselves to behave in a robotic fashion toward the purpose of “maximizing a utility” based on the very small number of patterns that they have consciously identified.
There’s no reason for a super-human intelligence (one with far more neocortex, or far more complex neocortex that’s equipped to do far more than model linear patterns, which perhaps automatically “sees” exponentials, cellular automata, and transcendental numbers) to be so limited.
Humans aren’t much good at intelligent planning that takes other minds, and “kinds of minds” into account. That’s why our societies regularly fall into a state of dominion and enslavement, and have to be “started over” from a position of utter chaos and destruction (ie: the rebuilding of Berlin and Dresden).
Far be it from me to be “mind-killed,” but I think avoiding that fate should be a common object of discussion among people who are “rational.” (ie: “What not to do.”)
I also don’t think it’s fair to lump “behaviorists” (other than perhaps B.F. Skinner) into an irrational school of “oversimplification.” Even Skinner noted that his goal was to get to the truth, via observation. (Eliminate biases.) By trying to make an entire school out of the implications of some minds, some of the time, we oversimplify the complex reality.
Behaviorism has caught scores of serial killers. (According to John Douglas, author of Mindhunter, originator of the FBI’s Investigative Support Unit.) How? It turns out that serial killer behavior isn’t that complex, and it’s seeking goals that superior minds actually can model quite accurately. (This is much like a toddler chasing a ball into the street. Every adult can model that as a “bad thing,” because their minds are superior enough to understand 1-what the child’s goal is 2-what the child’s probable failures in perception are 3-how the entire system of child, ball, street, and their inter-related feedbacks are likely to produce, as well as how the adult can, and should, swoop in and prevent the child from reaching the street.)
So, behaviorism does help us do two things: 4-eliminate errors from prior “schools” of philosophy (which were, themselves, not really “schools” but just significant insights) 5-reference “just what we can observe,” in terms of revealed preferences. Revealed preferences are not “the whole picture.” However, they do give us a starting point for isolating important variables.
This can be done with a robot or a human, but the human is a group of “messy emergent networks” (brain regions, combined with body feedback, with nagging long-term goals in the background, acting as an “action-shifting threshold”) whose goals are the result of modeled patterns and instances of reward. The robot, on the other hand, lacks all the messy patterns, and can often deal with reality as a set of extreme reductions, in a way that no (or few) humans can.
The entire “utility function” paradigm appears to be a very backwards way of approximating thought to me. First you start with perceived patterns, then, you evolve ever-more-complex thought.
This allows you to develop goals that are really worth solving.
What we want in a super-intelligence is actually “more effective libertarians.” Sure, we’ve found that free markets (very large free networks of humans) create wealth and prosperity. However, we’ve also found that there are large numbers of sociopaths who don’t care about wealth and prosperity for all, just for themselves. Such a goal structure can maximize prosperity for sociopaths, while destroying all wealth and prosperity for others. In fact, this repeatedly happens throughout history, right up to the present. It’s a cycle that’s been interfered with temporarily, but never broken.
Would any robot, by default, care about shifting that outcome of “sociopaths dominate grossly-imperfect legal institutions”? I doubt it. Moreover, such a sociopath could create a lasting peace by creating a very stable tyranny, replete with highly-functional secret police, and a highly effective algorithm for “how to steal the most from every producer, while sensing their threshold for rebellion.”
In fact, this is what the current system attempts to accomplish: There’s no reason for the system to decay to Hitler’s excesses, when scientists, producers, engineers, etc. have found (enough) happiness (and fear) in slavery. How much is “enough”? It’s “enough (happiness) to keep producing without rebellion,” and “enough (fear) to disincentivize rebellion.”
This is like baling a few thousand gallons of water while the Titanic is sinking. 6-It won’t make any difference to any important goal, short-term or long-term 7-It deals with a local situation that is irrelevant to anything important, worldwide 8-It deals with theories of the mind that are compatible with Francis Crick and Jeff Hawkins’ work, but only useful to narrow sub-disciplines like “How do we know when law enforcement should take action?” or “When we see this at a crime scene, it’s a good threshold-based variable for how many resources we should throw at the problem.” 9-Every “school” that stops referring to reality and nature, to the extent it does so, is horribly flawed (this is Jeff Hawkins, who is right about almost everything, screwing up royally in dismissing science fiction as “not having anything important to say about brain building.”) 10-When you’re studying human “schools,” you’re studying a narrow focus of human insight described with words(“labels” and “maps”) instead of the insight they’ve derived from their modeling of the territory. (Kozybski,who himself,turned a few insights into a “school”)
This article got me thinking about a few things. Centrally, there is no necessary condition on things which really do have goal directed behaviour to maximise. Revealed preference theory in economics effectively identifies the choices people make with their preference. Thus humans are seen as maximisers of their preferences in an almost trivial sense, if preferences are also assumed to fall under a ranking (another thing economists assume). Yet this goal orientated explanation in terms of rationality could be wrong for two reasons: firstly, the choices could not be goal orientated—they could simply be behaviour with no goal-based cause—or they could be the result of a different goal, which may not maximise. The first possibility is discussed in the article. The second is not, but is interesting, as the second condition could be met for many reasons. If I went to see a movie, a believer in revealed preference theory may conclude I preferred seeing that movie over all else that night. But perhaps I went to the wrong movie by mistake. Perhaps I was forced to go to the movie by gunpoint, and preferred to be at home and safe. Perhaps I really preferred to go to see a play, but hadn’t really reflected on my preferences very hard. Or perhaps I went to the movie for a friend who wanted to see it with me, but in fact I actually would have preferred to not go.
In some of those cases, we might explain the supposed problem away by reflecting that in those particular circumstances, my preference was in fact to go to the movie; but had the circumstances been different I would have preferred otherwise. For instance, at gunpoint, in those circumstances, I wouldn’t prefer to walk home, as I would get shot. However, this counterargument presumes we can’t preferences over counterfactual situations, which we evidently can have. Also, it isn’t very useful to the economist, who would like to conclude that preferences are stable, as so that the choice a person makes at one time will be repeated in the future.
Also, the assumption that the laser is for destruction may be false. What is the laser tags or scans blue objects? Maybe for a space probe, or a scan on enemy robots? What if the robot makes the assumption that the laser is for destruction when it’s not, or vise-versa? Does it really matter, when it is programmed to do it anyway? What if the AI is destroying, or thinks it’s destroying fellow robots? What if it does not want to? Maybe this is an example of religion from assumption? Because it is ‘destroying’ blue things, it assumes that is it’s purpose. Behavior/motivation from assumption...
This was very interesting, thanks for sharing. My first reply on-site (for a while at least), hope it wasn’t in too much of a different direction from what you meant.
I note that if the the robot only looks at the blue RGB component, then it will end up firing its laser not just at blue things (low R, low G, high B), but also white things (high R, high G, high B), fuchsia things (high R, low G, high B), teal things (low R, high G, high B), and variants of said colors. “Blue-minimizing” is not even a correct description!
Or it only gets disutility from seeing blue objects :P
It seems like we need to taboo the word “goal” and replace it with several different things. 1. Actions. 2. Conscious (verbal) intentions (past intentions may conflict with present intentions). 3. “unconscious intentions” (this should probably be tabooed, but I haven’t figured out how best to do it. Perhaps an unconscious intention is something which we do, but we don’t know why or we confabulate why.).
it seems that Anna Salamon’s and Hanson’s interpretations could simply be viewed as changing verbal intentions and the question if how to gain goal (intention) stability.
A lot of this is discussed at length in the book The Ecological Approach to Visual Perception by Gibson. He put out a pretty radical idea that all cognition is goal driven, and perception is the action of converting sensory energy into scored models (scores are computed as anticipated experience of goal-fulfillment). Fundamentally, all thinking is planning. There’s quite a bit of literature coming out on this in the fields of online learning theory and actionable information theory, in particular by Soatto at UCLA.
Curious that you worded it: ”...we would be tempted to call this robot a blue-minimizer.” Then you said the robot “wants”. The entire discussion which followed is invalid, because everyone involved assumed human characteristics to an electronic mechanical device. Robots do not have “goals” either, nor self-motivation. IT has no concept of reducing blue objects. Your “temptation” was your sensed instinct that you were stepping over the line of limitations. Even the origin of the word “robot” meaning “forced labor” is inaccurate.
And are you a utility-optimizer, or a behavior-executor?
Yes, yes, yes!
Your post reminded me of a quote by Eliezer Yudkowsky:
Humans are not selfish or altruistic, humans probably don’t even have a utility-function because our goals are not stable. Our goals are rigidly coupled with the environment and circumstances we reside in. Words are too simplistic to describe what we are and what we want.
Even what a human is is not stable.
My skin regenerates...my teeth are mineralized...I have some fillings...I eat some grapes...and spit out the pits...pull on some pants...and go drive a car. How am I qualitatively non-arbitrarily not my car?
The Voyager Horcrux has naked humans, but if hermit crabs were intelligent, would they send the probe up with an image of hermit crabs without shells?
I am not sure.
What is the difference between a smart ‘shoot lasers at “blue” things’ robot and a really dumb ‘minimize blue’ robot with a laser?
A really smart ‘shoot lasers at “blue” things’ robot will shoot at blue things if there are any, and will move in a programmed way if there aren’t. All its actions are triggered by the situation it is in; and if you want to make it smarter by giving it an ability to better distinguish actually-blue from blue-looking things, then any such activity must be triggered as well. If you program it to shoot at projectors that project blue things it won’t become smarter, it will just shoot at some non-blue things. If you paint it blue and put a mirror in front of it it will shoot at itself, and if you program it to not shoot at blue things that look like itself it won’t become smarter, it will just shoot at fewer blue things. If anything it shoots at doesn’t cease to be blue or you give it a blue laser or camera lens, it will just continue shooting because it doesn’t care about blue things or shooting; it just shoots when it sees blue. It certainly won’t create blue things to shoot at.
A really dumb ‘minimize blue’ robot with a laser will shoot at anything blue it sees, but if shooting at something doesn’t make it stop being blue, it will stop shooting at it. If there’s nothing blue around it will search for blue things. If you paint it blue and put a mirror in front of it it will shoot at itself. If you give it a blue camera lens it will shoot at something, stop shooting, shoot at something different, stop shooting, move around, shoot at something, stop shooting, etc, and eventually stop moving and shooting altogether and weep. If instead of the camera lens you give it a blue laser it will become terribly confused.
Why would a dumb ‘minimize blue’ robot search for blue things?
Current condition: I see no blue things.
Expected result of searching for blue things: There are more blue things.
Expected result of not searching for blue things: There are no blue things.
Where ‘blue’ is defined as ‘eliciting a defined response from the camera’. Nothing outside the view of the camera is blue by that definition.
...and a sufficiently non-dumb ‘minimize blue’ robot using that definition would disconnect its own camera.
Right. If we want anthropomorphic behavior, we need to have multiple motivations.
A “dumb robot” is assumed to be a weak optimizer. It isn’t defined to be actively biased and defective in known ways related to the human experience ‘denial’. Sure you can come up with specific ways that an optimizer could be broken and draw conclusions about what that particular defective robot will do. But these don’t lead to a conclusion where it makes sense to rhetorically imply the inconceivability of a robot that doesn’t have that particular bug.
I’m putting ‘dumb’ at roughly the level of cognition of a human infant, lacking object permanence. Human toddler intelligence counts as ‘smart’. I’m considering a strict reward system: negatrons added proportional to the number of blue objects detected.
Then, as per the grandparent, the answer to the rhetorical question “Why would a dumb ‘minimize blue’ robot search for blue things?” is because it doesn’t happen to be a robot designed with the exact same peculiarities and weaknesses of a human infant.
Lack of object permanence isn’t a peculiar weakness. The ability to spontaneously leave Plato’s cave is one of the things that I reserve for ‘smart’ actors as opposed to ‘dumb’ ones.
A smart robot that shoots lasers at blue things will shoot at blue things it models as being there even if it can’t see them.
A smart ‘shoot lasers at “blue” things’ robot may title the cosmic commons with blue things to be shot (exactly what it tiles the cosmic commons with depends on the details of implementation.) A really dumb minimize blue robot… um… it’ll shoot blue things. Probably. And sometimes miss.
I didn’t mean a smart “maximize blue things shot with lasers” robot. Although I suppose creating blue things to shoot is a reasonable action to take once all the easily accessible blue things have been destroyed.
Oddly enough, a similar behavior has been noted in AA and other rehab support groups; when there are no more easily accessible addicts to cure, someone will relapse. That’s perfectly rational behavior for a group that wants to rehabilitate people, even if it isn’t conscious.
This post brings to mind a fault I see with trying to create a trust system of Friendly AI’s. Humans are inherently untrustworthy and random. A robotic AI that is built to be friendly (or at least interact) with humans should by most concepts have a set of rules to work with that minimize its desire to kill us all and turn us into computer fuel. Even in an imperfect world, AI’s would be trained to deal with humans in a forthright and honest fashion, to give the truth when asked, and to build assumptions based on facts and real information. Humans, however, are irrational creatures that lie, cheat, and steal when it is within our own best interest to do so, and we do it on a regular basis. For those of you who disagree with that premise, please look at the litany of laws we are asked to follow on a daily basis, starting with traffic laws. Imagine a world of AI drivers and place a human in their midst. Then take away all ‘rules’ that force every driver to move in x direction at y speed on a given roadway. The AI drivers would move with purpose, but be programmed to understand how important their speed and direction was based on the purpose of their travel. Those going to work at a leisurely pace would drive slower, and congregate in one or two areas of the road. The AI’s that need to run a speedy errand or who are in an emergency would move faster and be programmed to take into account the slower vehicles. But a human in their midst would not care about the others so much as about their own personal issues. They would want to move faster because they like driving fast, or would want to move in the right because they get nervous in the left lanes. Or perhaps they would drive in the area that gave them the best view of the sunset, and slow down to enjoy it—forcing the AI’s behind them to slow down as well.
And when we take the example of an AI who is supposed to work with humans as a receptionist, what does it do when the AI is faced with a human who lies to get past it? If the human lies convincingly and the AI let’s him go, how will the AI react when it finds out the human lied? Are all humans bad? If the same human returns and is now part of the company, will the AI no longer ‘trust’ that human’s information? If a human uses the AI to mess with another human (don’t tell me people never use computers to play pranks on each other) how will the AI ‘feel’ about being used in such a manner? As humans, we have a set of emotions and memories that allow us to deal with people who do such things. Perhaps we would have a stern chat with the guy who tried to get past us, or play a prank back on the gal who messed with us last time. But should computers be equipped with such a mechanism? I really do not believe so. It is a slippery slope for a robot to play tricks on a human. Unless they are very advanced (such as body scanners that serve as lie detectors), there is little room for them to do anything but trust us.
Have you read the sequences yet? It seems like you’re anthropomorphizing AI to an unreasonable degree (yes, arguing about how they’re going to be different from us can still be too anthropomorphizing,) and that humans are “inherently untrustworthy and random” is a pretty confused statement. Humans are chaotic (difficult to predict without very complete information,) but not random (outcomes chosen arbitrarily from among the available options,) and as for “inherently untrustworthy, it’s not really even clear what such a statement would mean. That may sound overly critical or pedantic, but it’s really not obvious, for instance, what if anything you think would qualify as not inherently untrustworthy, and why you think they’re different.