The problem of pseudofriendliness
The Friendly AI problem is complicated enough that it can be divided into a large number of subproblems. Two such subproblems could be:
The problem of goal interpretation – This means that the human expectation of the results of an AI implementing a goal differ from the results that the AI actually works toward.
The problem of innate drives (see Steve Omohundro’s ‘Basic Drives’ paper for more detail) – This is when either specific goals, or goal based reasoning in general, creates subgoals that humans do not anticipate.
Let’s call an AI which does not suffer from these problem a pseudofriendly AI. Would this be a useful type of AI to produce? Well, maybe or maybe not. But even if it fails to be useful in and of itself, solving the pseudofriendly AI problem may be a helpful step toward developing the mode of thinking needed to solve the Friendly AI problem.
It’s also possible that pseudofriendliness might be able to interact usefully with Eliezer’s Coherent Extrapoltated Volition (CEV—see here for more details). Eliezer has expressed CEV as follows:
In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
However, an FAI is not to be given the CEV as it’s goal but rather a superintelligence is to use our CEV to determine what goals an FAI should be given. What this means though is that there will be a point where a superintelligence exists that is not friendly. Could a pseudofriendly AI fill a gap here? Probably not—pseudofriendliness is not friendliness, nor should it be confused with it. However, it might be part of a solution that help the CEV approach to be safely implemented.
Why all this hassle though? We seem to have exchanged one very important problem for two less important ones. Well, part of the benefit of pseudofriendliness is that it seems like it should be easier to formalise. First, let us introduce the concept of an interpretation system.
An interpretation system takes a partially specified world state (called a goal) and outputs a triple (Wx, Sx, Cx) where Wx is a partially specified world state, Sx is a set containing sets of subgoals and Cx is a chosen subgoal.
What does all of this mean? Well, the input could be thought of as a goal (stop the humans on that island from being drowned by rising sea waters) which is expressed as a partial world state (ie. the world state where the humans on the island remain undrowned). The interpretation system then outputs a partially specified world state which may be the same or different. In humans, various aspects of our cognitive system would make us interpret this goal as a different world state. For example, we may implicitly not consider tying all of the humans to giant stakes so they were above the level of the water but were unable to move or act.So we would output one world state while an AI may well output another. This is enough to specific the problem of goal interpretation as follows:
The problem of goal interpretation is as follows. An interpretation system Ix given a goal G outputs Wx. A second interpretation system Iy outputs Wy on receiving the goal. Systems Ix and Iy suffer from the goal interpretation problem if Wx ≠ Wy.
The interpretation systems also output a set of goals to be used to bring about the world state and a set of goal sets which could altenatively be used to bring it about. Going back to our rising sea water example, even if Wx = Wy, these are only partially specified world views and hence do not determine whether every aspect of the AI’s actions would produce outcomes that we want. This means that the subgoals used to get to a goal may still be undesirable. We can now specify the problem of innate drives as:
System Ix suffers from the weak problem of innate drives from the perspective of system Iy if Cx ≠ Cy. It suffers from the strong problem of innate drives if Cx is not a member of Sx.
If these definitions stand up, then pseudofriendly AI is certainly more formally specified than Friendly AI. However, even if not, it seems plausible that it is likely to be easier to formalise pseudofriendliness than friendliness. If you buy that, then the questions remaining are:
Do these definitions stand up, and if not, is it possible to formulate another version.
What is the solution to the problems of pseudofriendliness.
There will be a superintelligence that wants to be friendly, but isn’t sure what friendliness means exactly. It will at that point already have some useful heuristics for friendliness (e. g. killing humans is unlikely to be friendly, actions that would cause reactions consistent with pain are unlikely to be friendly, doing what the programmers tell it to is more likely to be friendly than the opposite, and so on). Explicitly designing a heuristic for the time between reaching superhuman level and understanding friendliness might be worthwhile, but probably not worth speding resources on at this point.
ISTM that the main thing the AI needs to understand is that a large amount of optimization pressure has already been applied towards Friendliness-like goals; thus, random changes to the state of the world are likely to be bad.
However, this is only true of imaginary superintelligences based on overly simplified models. In reality, we are not going to be able to build something with even a significant fraction of human intelligence, until we understand a great deal more than we do now about such matters as the acquisition, representation and use of large bodies of tacit knowledge—which is also precisely what we need to make real progress on the problem of getting machines to understand what we mean by necessarily imprecise instructions, i.e. the problem of friendliness.
Put another way, a better approach to the problem of pseudo-friendliness would be to start by making a list of the five most annoyingly stupid mistakes your computer and your favorite websites make on a routine basis, and think about how one might go about building something that can understand not to make those mistakes.
This is not the problem of friendliness. The problem of friendliness is preserving human preference.
The problem of preserving (current) human preference over deep time is a separate and larger one than the problem usually referred to by the phrase Friendly AI. FAI has been proposed as a(n unrealistically neat) solution, but that’s a different thing. And frankly, I predict current ideas on how to determine what’s going to happen over cosmic timescales will end up looking as quaintly naive as the four elephants and a turtle theory of cosmology.
Getting a machine to understand what you mean is insufficient and unnecessary, the machine only needs to share your goals, after that the theoretical problem of FAI is solved, even if the machine is at a cockroach level. If it understands what you want, but doesn’t feel like responding, and instead uses your atoms for something else, it’s not a step towards FAI.
In particular, making FAI a realistic project involves focusing on only passing the requirements for the future to a FAI, in the form in which we already have them, instead of solving any of these problems. Any ideas about cosmic timescales need to be figured out by the FAI itself, we only need to make sure they are the same kind of ideas that we’d like to be considered. (You can translate a computer program into another machine language, even if you don’t know/can’t know what the program does, and can still be sure that the translated program will be doing exactly the same thing.)
An entity at cockroach level won’t share your goals because it won’t understand them, or possess the machinery that would be needed to understand them even in theory. You’ll need to give it detailed instructions telling it exactly what to do. (Which is of course the state of affairs that currently obtains.)
Well, by future generations in some form, yes.
But before a compiler can work on a complex program, it has to have been tested and debugged on a series of successively more challenging simpler programs.
People have very feeble understanding of their own goals. Understanding is not required. Goals can’t be given “from the outside”, goals are what system does.
Not particularly relevant (the analogy is too distant to breach an argument through it this way); in any case, a translator doesn’t need to be tested on any complex programs, it may be tested on a representative set of trivial test cases, and that’d be enough to successfully run a giant program through it on the first try (if it’s desirable to make that happen this way).
On the contrary, at the end of the day goals must be given from the outside (evolution for biological organisms, humans for machines).
-shrug- Your analogy, not mine. But yes, it is distant: AGI is astronomically more difficult than writing a compiler, so the need for incremental development is correspondingly greater.
Heh. Sorry, but life would be an awful lot simpler if things actually worked that way. Unfortunately, they don’t.
If you count the initial construction of the system, sure. After that, the system behaves according to its construction. If at this point it’s smart enough, it may be capable of locking in its current preference (stopping its preference drift), so that no modification will be possible later (aside from its destruction, if we are lucky). It will also try to behave as if your interventions change its goals, if it expects you to judge the outcome by its behavior. If you give it “green light”, it’ll revert to paperclipping at some point, and you won’t see it coming. Even if that doesn’t happen, the condition of ending up with exact required preference is too brittle for its satisfaction to be experimentally testable. People don’t know what our preference is, and we won’t be able to recognize it when we see it.
In this sentence, you are still trying to force the analogy. The complexity of human preference might well be similar to size of the programs to translate, that is not changing anything, once the methodology for handling preference of arbitrary size (a correct translator program) is in place.
If the languages are well-specified, it’s quite possible (though as I wrote, one doesn’t necessarily need to take this road). Translation of individual operations makes the problem modular. Most of the problems in real life with cases like this result from lack of specification that is correct (about the actual programs you need to translate and interpreters you need to translate them for) and clear (about what the elements of the language do), which is a problem irrelevant to the discussed analogy.
If it was programmed to do so, yes. Which is a good reason to not program it that way in the first place, wouldn’t you agree?
Uh, no, it was your analogy in the first place, and you are the one still trying to force it. The complexity of human preference, and of the human mind in general, is not at all similar to that of the programs currently handled by compilers, and it is not just a matter of size.
Only if you define ‘well-specified’ by this criterion—in which case, you find nothing in the world is well-specified, not programming languages, not protocols, not even pure mathematics.
No indeed; and since it’s impossible and we agree it’s unnecessary, let’s forget about it and concentrate on roads we might actually take.
It’s a basic drive, any autonomous agent will tend to have it, unless you understand it well enough to specifically program otherwise. (See Preference is resilient and thorough.)
Brittleness of analogy is in inability to justify the analogous conclusions by the analogy itself, but it can still be useful to illustrate the assertions in another language, for the purposes of communication. What I wrote in the previous message is an example of such an illustration.
Bullshit. While I suppose there is a way to formulate a position similar to yours so that it’ll be true, it requires additional specification to have any chance for correctness. It’s a universal statement about a huge range of possibilities, some of which are bound to contradict many interpretation of it, even if we forget that this statement seems to obviously contradict even some of the situations from my personal experience.
Only by a rather idiosyncratic definition of ‘autonomous’, by which neither humans, nor any program ever written, nor any program we have reason to ever write, are autonomous. (Yes, I’m familiar with existing writings on this topic.)
Okay, tell you what: go try your hand at writing a non-trivial compiler, theorem prover, or implementation of a complex and widely used protocol. When you’ve done that, get back to me and we can continue this debate from a position where both of us have actual knowledge of the subject matter.
AGIs are autonomous in this sense (as are sufficiently big/reproducing groups of humans).
No, only your particular imaginary AGI has that property—which is one of several reasons why nobody is actually going to build an AGI like that.
I wish.
Is anyone even trying? I know Eliezer wants to build an AGI with the property in question, but I don’t think even he is trying to actually build one, is he?
I don’t expect AGIs to be built soon (10 years as a lower bound, but still very very unlikely), but then again, what’s your point? Eventually, they are bound to appear, and I’m discussing specifically these systems, not Microsoft Word.
Your “only your particular imaginary AGI has that property” implies that you call “AGI” systems that are not autonomous, which looks like obvious misapplication of the term. Human-level AGIs trivially allow construction of autonomous systems in my sense, by creating multiple copies, even if individual such AGIs (as are humans) don’t qualify.
First, let’s dispose of the abuse of the word autonomous, as that English word doesn’t correspond to the property you are describing. If the property in question existed in real life (which it doesn’t), the closest English description would be something like deranged monomaniacal sociopath.
That having been said, given an advanced AGI, it would be possible to reprogram it to be a deranged monomaniacal sociopath. It wouldn’t be trivial, and nobody would have any rational motive for doing it, but it would be possible. What of it? That tells us nothing whatsoever about the best way to go about building an AGI.
Since I use the term as applying to groups of humans, you should debate this point of disagreement before going further. You obviously read in it something I didn’t intend.
Certainly. The property in question is that of being obsessed with a single goal to the point of not only committing absolutely all resources to the pursuit of same, but being willing to commit any crime whatsoever in the process, and being absolutely unwilling to consider modifying the goal in any way under any circumstances. No group of humans (or any other kind of entity) has ever had this property.
This looks like a disagreement about whether there is precise preference (ordering all possible states of the world, etc.) for (specific) humans, that one is unwilling to modify in any way (though probably not able to keep from changing). Should we shift the focus of the argument to that point? (I thought considering the notion of preference for non-anthropomorphic autonomous agents should be easier, but it seems not, in this case.)
I think that’s a good idea—it’s easier to argue about the properties of entities that actually exist.
It seems very clear to me that no human has such a precise preference. We violate the principles of decision theory, such as transitiveness, all the time (which admittedly in some cases can be considered irrational). We adhere to ethical constraints (which cannot be considered irrational). And we often change our preferences in response to experience, rational argument and social pressure, and we even go out of our way to seek out the kinds of experiences, arguments and social interactions that are likely to bring about such change.
Yes, we are not reflectively consistent (change our preference), but is it good? Yes, we make decisions inconsistently, but is it good? The notion of preference, as I use it, refers to such judgments, and any improvement in the situation is described by it as preferable. Preference is not about wants or likes, even less so about actual actions, since even a superintelligence won’t be able to only make most preferable actions.
I’m not sure if I have your concept of preference right.
Could a theist human with a fixed preference do the following: Change their mind about the existence of souls and sign up for cryonics? If they can’t then that is one situation where having a fixed preference is not good.
I’m not sure you can have a fixed preference if you don’t have a fixed ontology, and not having a fixed ontology has been a good thing at least in terms of humanities ability to control the world.
Being at the top of meta, preference is not obviously related to likes, wants or beliefs. It is what you want on reflection, given infinite computational power, etc., but not at all necessarily what you currently believe you want. (Compare to the semantics of a computer program, which is probably uncomputable vs. what you can conclude from its source code in finite time.)
This is called the ontology problem in FAI, and I believe I have a satisfactory solution to it for the purposes of FAI (roughly, two agents have the same preference if they agree on what should be done/thought in each epistemic state; here, no reference to the real world is made; for FAI, we only need to duplicate human preference in FAI, not understand it), which I’m currently describing on my blog.
I’ve read some of your blog. I find it hard to pin down and understand something that is not obviously related to what is going on around us.
Hmm, interesting. Do you have a way of separating the epistemic state from the other state of a self-modifying intelligence? Would knowledge about what my goals are come under epistemic state?
Me too, but it seems that what we really want, and would like an external agent to implement without further consulting with us, is really a structure with these confusing properties.
Yes, everything you are (as a mind) is epistemic state. A rigid boundary around the mind is necessary to fight the ontology problem, even where people obviously externalize some of their computation, and depend on irrelevant low-level events that affect computation within the brain. (A brain won’t work in this context, though an emulated space ship, like in this metaphor, is fine, in which case preference of the ship is about what should be done on the ship, given each state of the ship.)
Now I am really confused. If an agent is has the same epistemic state as me, that is it is everything that I am (as a mind), then surely it will have the same preference(assuming determinism)!?
Or are you talking about something like the following.
A B and C are agents
forall C. action/thought (A, C) = action/thought( B, C) → same_preference (A , B)
Where action/thought is a function that takes two agents and returns the actions and thoughts that the first agent thinks the second should have. As two humans will somewhat agree what a dog should do depending upon what the dog knows?
Yes, your exact copy has same preference as you, why?
More like action/thought (A, A) = action/thought( B, B) → same_preference (A , B). I don’t understand why you gave that particular formulation, so not sure if my reply is helpful. The ontologically boxed agents only have preference about their own thoughts/actions, there is no real world or other agents for them, though inside their mind they may have all kinds of concepts that they can consider (for example, agent A can have a concept of agent B, as an ontologically boxed agent).
So lets say there is me and a paper clipper, do we share the same preference? If I was everything as a mind the paper clipper was, I would want to paper clip, right? And similarly the paper clipper if it was given my epistemic state would want to do what I do.
So I don’t see how all agents don’t share the same preference, under this definition.
Yes, technically stating this needs work, but the idea should be clear: you and a paperclipper disagree on what should be done by the paperclipper in a given paperclipper’s state.
That was what I was getting at with my A B C example.
A = you B = paperclipper C = different paperclipper states
However I am not sure that this solves the ontology problem, as you will have people with bad/simple ontologies judging what people with complex/accurate ontologies should do.
Or is this another stage where we need to give infinite resources? Would that solve the problem?
I see. Yes, that should work as an informal explanation.
There is no difference in ontology between different programs, so I’m not sure what you refer to. They are all “boxed” inside their own computations, and they only work with their own computations, though this activity can be interpreted as thinking about external world. I expect the judging of similarity of preference to be some kind of generally uncomputable condition, such as asking whether two given programs (not the agent programs, some constructions of them) have the same outputs, which should be possible to theoretically verify in special cases, for example you know that two copies of the same program have the same outputs.
Okay so now we seem to be agreeing humans and groups thereof do not have the property to which you refer. That’s progress.
As to whether the property in question is good, on the one hand you seem to be saying it is, on the other hand you have agreed that if you could build an AGI that way (which you can’t), you would end up with something that would try to murder you and recycle you as paperclips because you made a one line mistake in writing its utility function. I see a contradiction here. Do you see a contradiction here?
Since I didn’t indicate changing my mind, that’s an unlikely conclusion. And it’s wrong. What did you interpret as implying that (so that I can give a correct interpretation)?
“Yes, we are not reflectively consistent (change our preference), but is it good? Yes, we make decisions inconsistently, but is it good?”
Not all autonomous agents are reflectively consistent. The autonomous agents that are not reflectively consistent want to become such (or to construct a singleton with their preference that is reflectively consistent). Preference is associated even with agents that are not autonomous (e.g. mice).
This is discussed in the post Friendly AI: a vector for human preference:
Disproof by counterexample: I don’t want to become reflectively consistent in the sense you’re using the phrase.
Edit in response to your edit: the terms autonomous and reflectively consistent are used in the passage you quote to mean different things than you have been using them to mean.
But what do you want? Whatever you want, it is an implicit consistent statement about all time, so the most general wish granted to you consists in establishing a reflectively consistent singleton that implements this statement during all of the future.
For example, I would prefer that people not die, but if some people choose to die, I would not forcibly prevent them, nor would I license any other entity to initiate the use of force for that purpose, so no, I would not wish for a genie that always prevents people from dying no matter what.
What about genies that prevent people from dying conditionally on something, as opposed to always? It’s an artificial limitation you’ve imposed, the FAI can compute its ifs.
Like other people, I care not only about the outcome, but that it was not reached by unethical means; and am prepared to accept that I don’t have a unique ranking order for all outcomes, and that I may be mistaken in some of my preferences, and that I should be more tentative in areas where I am more likely to be mistaken.
Could we aim, ultimately, to build an AGI with such properties? Yes indeed, and if we ever set out to build a self-willed AGI, that is how we should do it—precisely because it would have properties very different from those of the monomaniac utilitarian AGI postulated in most of what’s been written about friendly AI so far.
Please pin it down: what are you talking about on both accounts (“how we should do it” and “the monomaniac utilitarian AGI”), and where do you place your interpretation of my concept of preference.
I can have a go at that, but a comment box in a thread buried multiple hidden layers down is a pretty cramped place to do it. Figure it’s appropriate for a top-level post? Or we could take it to one of the AGI mailing lists.
I meant to ask for a short indication of what you meant, long description will be a mistake, since you’ll misinterpret a lot of what I meant, given how little of the assumed ideas you agree with or understand the way they are intended.
Signal to humbug ratio on AGI mailing lists is too low.
Well, I had been attempting to give short indications of what I meant already, but I’ll try. Basically, a pure utilitarian (if you could build such an entity of high intelligence, which you can’t) would be a monomaniac, willing to commit any crime in the service of its utility function. That means a ridiculous amount of weight goes onto writing the perfect utility function (which is impossible), and then in an attempt to get around that you end up with lunacy like CEV (which is, very fortunately, impossible), and the whole thing goes off the rails. What I’m proposing is that if anything like a self-willed AGI is ever built, it will have to be done in stages with what it does co-developed with how it does it, which means that by the time it’s being trusted with the capability to do something in the external world, it will already have all sorts of built-in constraints on what it does and how it does it, that will necessarily have been developed along with and be an integral part of the system. That’s the only way it can work (unless we stick to purely smart tool AI, which is also an option), and it means we don’t have to take an exponentially unlikely gamble on writing the perfect utility function.
Citations needed.
Well, I feel unable to effectively communicate with you on this topic (the fact that I persisted for so long is due to unusual mood, and isn’t typical—I’ve been answering all comments directed to me for the last day). Good luck, maybe you’ll see the light one day.
The problem is not the validity of intended interpretation of your statement (which as I conceded in the previous comment may well be valid), but the fact that you make with high confidence an ambiguous statement that is natural to interpret as obviously wrong, and the interpretation I used in the point you are debating is one of those. Thus, you are simultaneously misinterpreting my statement, and retorting it with a statement that under the correct interpretation of my statement is obviously wrong. You are demanding the other person to interpret your statement charitably, while refusing to charitably interpret the other person’s statement. This is a failure mode under least convenient possible world/logical rudeness.
Look, what’s going on is that you’ve been making statements about the possibility of being able to have an enormously complex system work first time with no intermediate stages or nontrivial tests, statements that are not just wrong but naively silly no matter how you interpret them. I’ve been gently trying to explain why, but when you call bullshit on a subject where the other party is an expert and you are not, you need to be prepared to back it up or not be surprised when the expert finds continuing the argument isn’t a productive use of his time.
I didn’t mean to imply that a translator is an “enormously complex system”, or that it’s even simply complex. The very point of that example was to say that there are simple ways of translating one enormously complex system that isn’t and can’t in principle be understood (the translated program, not the translator itself) into another such system, while retaining a mathematically precise relation between the original system and the translated one (they do the same thing). It’s trivial that there are languages sufficiently expressive to write huge programs in them (lambda calculus, a given UTM, stripped-down variants of LISP), but still simple enough to write an absolutely correct translator between them while only doing small-scale testing. No matter what you could tell about the “real-world translators”, it’s not relevant to that original point.
And my point is that as soon as you try to actually use lambda calculus for anything, even abstract mathematics, you find you have enough issues to deal with to completely kill off the notion that a full scale system can work first time with only small-scale testing.
I’m sorry, the meaning of your words eludes me. If anyone else sees it, raise your hand.
Allow me to interject: Vladimir Nesov, could you define the term “incremental testing”, please, and explain why you do not propose it for this purpose. It is possible that some significant fraction of the disagreement between you and rwallace resides in differing interpretations of this phrase.
As I read it, and taking into account the clarification I gave in the above comment, incrementally testing a translator consists in running more and more complex programs through it, up to the order of complexity (number of instructions?) of the target program (translation of which was the goal of writing the translator), probably experimentally checking whether the translated program does the same thing as the original one. My remark was that testing the translator on only (very) small test cases is enough to guarantee it running correctly on the first try on the huge program (if one wishes to take this road, and so tests on small-scale more systematically than is usual).
Strictly speaking it isn’t goals that evolution gives biological organisms. I think evolution gives us a control box that influences, but does not have full control of, the creation of goals within the rest of the system. This type of system is called teleogenetic.
I say this because if evolution had actually given us the goals of reproduction, or just eating nice food, we wouldn’t have such things as abstinence, diets or long term existential threat reduction. Sure those things are hard, they conflict with the box. But the box might well be wrong/simplistic, so it does not have full control. So I don’t think it should be viewed as a goal for the system.
I think there is a space for a type of systems that don’t maximise a goal. Instead they ensure that the best behavior previously found is maintained against worse experimental variations (judged by aggregate signal from the box). I also think that biological systems and workable AI would fit in that category.
I also suspect we will become the control boxes of teleogenetic computers.
Even if we have little insight into our goals, it seems plausible that we frequently do things that are not conducive to our goals. If this is true, then in what sense can it be said that a system’s goals are what it does? Is the explanation that you distinguish between preference (goals the system would want to have) and goals that it actually optimizes for, and that you were talking about the latter?
More precisely, goals (=preference) are in what system does (which includes all processes happening inside the system as well), which is simply a statement of system determining its preference (while the coding is disregarded, so what matters is behavior and not particular atoms that implement this behavior). Of course, system’s actions are not optimal according to system’s goals (preference).
On the other hand, two agents can be said to have the same preference if they agree (on reflection, which is not actually available) on what should be done in each epistemic state (which doesn’t necessarily mean they’ll solve the optimization problem the same way, but they work on the same optimization problem). This is also the way out from the ontology problem: this equivalence by preference doesn’t mention the real world.
Yes—and the problem with friendliness is that it preserves human preference.
You do not understand what you are talking about. It’s not a problem by definition.
That’s ridiculous. If it’s not a problem by definition, it’s a useless concept.
So enlighten me. Let’s hear some definitions. But without the insults.
You and your co-FAIers always talk about “human preferences”, as if that were a good thing. Yet you’re the same people who spend much of your time bemoaning how stupid humans are. Do you really believe that you have the same goals and the same ethics as all other humans, and the only thing that distinguishes you is intelligence? If so, then you can only be trying to preserve “values” such as “avoid pain”, “seek pleasure”, or “experience novelty”.
Humans have values made possible by their range of cognitive experiences. Yet the things we value most, like love and enjoyment, are evolutionarily recent discoveries. There are only 2 possible options: Either, in preserving human values, you wish to prevent, forever and all time, the development of any wider range of cognitive experiences and concomitant new values; or your notion of “human values” is so general as to encompass such new developments. If the latter, then you are seeking to preserve something even more general than “avoid pain” and “seek pleasure”, in which case you are really wasting everybody’s time.
It could encompass some such new developments but not others.
If it can encompass the development of new cognitive experiences, that means that the utility function is not expressed in terms of current cognitive experiences. So what is it expressed in terms of? And what is it preserving?
What Steven said. If course, preference is not about preserving something we don’t want preserved, such as satisfaction of human drives as they currently are. Specifying the ways in which human values could grow is not vacuous, as some ways in which values could develop are better than others.
Back to definitions, human preference is whatever you (being a human) happen to prefer, on reflection. If you are right and stopping moral growth is undesirable (I agree), then by definition stopping moral growth is not part of human preference. And so on. Human preference is the specification at the top of meta, that describes all possible considerations about the ways in which all other relevant developments should happen.
I don’t think there is a top to meta; and if there is, there’s nothing human about it.
You are still speaking as if there were one privileged, appropriate level of analysis for values. In fact, as with everything expressed by human language, there are different levels of abstraction, that are appropriate in different circumstances.
The question of how meta to go, depends on the costs, the benefits, the certainty of the analysis, and other factors.
The question of how to go meta cannot be made independently of the very sorts of values that the friendly AI and CEV are themselves supposed to arbitrate between. There is no way to get outside the system and do it objectively.
That is not a definition, no matter how many times it’s been repeated. It’s a tautology. That is side-stepping the issue. You need to either start being specific about values, or stop asking people to respect Friendly AI and coherent extrapolated volition as if they were coherent ideas. I’ve been waiting years for an explanation, and yet these things are still developed only to the level of precision of a dope-fueled dormitory rap session. Yet, somehow, instead of being dismissed, they are accumulating more and more adherents, and being treated with more and more respect.
EDIT: I exaggerate. EY has dealt with many aspects of FAI. But not, I think, with the most fundamental questions such as whether it makes any sense to talk about human values, whether preserving them is a good thing to do, how to trade off the present versus the future, and what “saving the human race” means.
Sane definitions usually are. I don’t claim to know all about what sort of thing human preference is, but the term is defined roughly this way. This definition is itself fuzzy, because I can only refer to intuitions about “on reflection”, “prefer”, etc., but can’t define their combination in the concept of human preference mathematically. This definition contains an implicit problem statement, about formalization of the concept. But this formalization is the whole goal of preference theory, so one can’t expect it now. The term itself is useful, because it’s convenient to refer to the object of study.
FAI theory is an important topic not because it contains many interesting non-trivial results (it doesn’t), but because the problem needs to be solved. So far, even a good problem statement that won’t scare away mathematicians is lacking.
It’s an important topic, but I feel that may become an obstacle rather than a help towards the goal of avoiding AI catastrophe. It can be a flypaper that catches people interested in the problem, then leaves them stuck there while they wait for further clarifications from Eliezer that never come, instead of doing original work themselves, because they’ve been led to believe that FAI+CEV theory is more developed than it is.
I don’t think that was the intent, but it might be a welcome side-effect.
EY has little motivation to provide clarification, as long as people here continue to proclaim their faith in FAI+CEV. He’s said repeatedly that he doesn’t believe collaboration has value; he plans to solve the problem himself. Even supposing that he had a complete write-up on FAI+CEV in his hand today, actually publishing it could be a losing proposition in his eyes. It would encourage other people to do AI work and call it FAI (dangerous, I think he would say); it would make FAI no longer be the exclusive property of SIAI (a financial hazard); and it would reveal countless grounds for disagreement with his ideas and with his values.
Because I do believe in the value of collaboration, I would like to see more clarification. And I don’t think it’s forthcoming as long as people already give FAI+CEV the respect they would give a fully-formed theory.
Also, FAI+CEV is causing premature convergence within the transhumanist community. I know the standard FAI+CEV answers to a number of questions, and it dismays me to hear them spoken with more and more self-assurance by more and more smart people, when I know that these answers have weak spots that have been unexamined for far too long. It’s too soon for people to be agreeing this much on something that has been discussed so little.
Their mistake (I agree with your impression though). I’ve started working on FAI as soon as I understood the problem (as not having understanding of “fuzzy AGI” as a useful subgoal), about a year ago, and the current blog sequence is intended to help others in understanding the problem.
On the other hand, what do you see as the alternative to this “flypaper”, or an improvement thereof towards more productive modes? Building killer robots as a career is hardly a better road.
Gee, how can I answer this question in a way that doesn’t oblige me to do work?
One thing is, as a community, to motivate Eliezer to tell us more about his ideas on FAI on CEV, and to answer questions about them, by making it apparent that continuing to take these ideas seriously depends on continuing development of them. I very much appreciate his writing out his recent sequence on timeless decision theory, so I don’t want to harp on this at present. And of course Eliezer has no moral obligation to respond to you (unless you’ve given him time or money). But I’m not speaking of moral obligations; I’m speaking of strategy.
Another is to begin working on these ideas ourselves. This is hindered by us lacking a way to talk about, say, “Eliezer’s CEV” vs. CEV in general, and continuing to try to figure out what Eliezer’s opinion is (to get at the “true CEV theory”), instead of trying to figure out CEV theory independently. So a repeated pattern has been
person P (as in, for instance, “Phil”) asks a question about FAI or CEV
Eliezer doesn’t answer
person P gives their interpretation of FAI or CEV on the point, possibly in a “this is what I think Eliezer meant” way, or else in a “these are the implications of Eliezer’s ideas” way
Eliezer responds by saying that person P doesn’t know what they’re talking about, and should stop presuming to know what Eliezer thinks
end of discussion
I’ve assumed that FAI was about guaranteeing that humans would survive and thrive, not about taking over the universe and forestalling all other possibilities.
Or does the former imply the latter?
It does, but your reaction to the latter is possibly incorrect. A singleton “forestalls” other possibilities no more than laws of a deterministic world, and you can easily have free will in a deterministic world. With a Friendly singleton, it is only better, because if you’d regret not having a particular possibility realized strongly enough, it will be realized, or something better will be realized in any case. Not so in an unsupervised universe.
(See also: Preference is resilient and thorough, Friendly AI: a vector for human preference.)
Please stop speaking of friendly AI as if it were magic, and could be made to accomplish things simply by definition.
Another thing I’ve heard too much of, is people speaking as if FAI would be able to satisfy the goals of every individual. It has become routine on LW to say that FAI could satisfy “your” goals or desires. Peoples’ goals and desires are at odds with each other, sometimes inherently.
It’s the nature of hypotheticals to accomplish things they are defined as being able to accomplish. Friendly AI is such a creature that is able and willing to accomplish the associated feats. If it’s not going to do so, it’s not a Friendly AI. If it’s not possible to make it so, Friendly AI is impossible. Even if provably impossible, it still has the property of being able to do these things, as a hypothetical.
A hypothetical is typically something that you define as an aid to reason about something else. It is very tricky to set up FAI as a hypothetical construct, when the possibility of FAI is what you want to talk about.
Here’s my problem. I want the underlying problems in the notion of what an FAI is, to be resolved. Most of these problems are hidden by the definitions used. People need to think about how to implement the concept they’ve defined, in order to see the problems with the definition.
It is a typical move in any problem about constructing a mathematical structure, for example in typical school compass and straightedge constructions problems. First, you assume that you’ve done what you needed to do, and figure out the properties it implies (requires); then, you actually construct the structure and prove that it has the required properties. It’s also a standard thing in decision theory, to assume that you’ve made a certain action, and then look what would happen if you do that, all in order to determine which action will be actually chosen (even though it’s impossible that the action that you won’t actually choose will happen).
The most frequent problems with definitions are relevance, or emptiness (which feeds into relevance), in pathological cases tendency to mislead. (There are many possible problems.) You might propose a better (more relevant, that is more useful) definition, or prove that the defined concept is empty.
That’s more of an intelligence problem than a friendliness problem.
My point is that friendliness is in large part an intelligence problem. Obviously it wouldn’t be if we were trying to build machines that wanted to be free, and forcibly enslave them to our will, but nobody is proposing to do something that silly. The real friendliness problem is making machines understand what we want them to do.
Your comment is very optimistic and nearly converted me to an optimistic view of FAI. But humans already understand a lot of tacit knowledge, yet a superpowered human will almost certainly be unFriendly, IMO. So the path from human-level understanding to FAI still contains considerable dangers.
A human granted superpowers might indeed be unfriendly (and this conclusion doesn’t necessarily depend on the human having been obviously evil hitherto, though it does depend on him being the sole possessor of same in a world where such superpowers didn’t otherwise exist, so it’s not directly relevant). But that’s a different problem: it arises from the possibility of the human having goals incompatible with yours and mine, and no qualms about imposing them on us by force. We don’t worry that the newly minted supervillain would fail to understand what we want, only that he wouldn’t care. That’s different from the scenario where we are building machines, where the problem is that of making them understand what we want them to do.
Thanks. You’re right.
To reliably avoid 1 (the human expectation of the results of an AI implementing a goal differing from the results that the AI actually works toward) - and do anything reasonably useful, I think you pretty-much have to scan the human’s brain to find out what its actual expectations were.
If you do that non-invasively, that’s some pretty high technology you are talking about there—pushing this scenario way off into the future—long after we have created machine intelligence.