Thoughts on Iason Gabriel’s Artificial Intelligence, Values, and Alignment
Iason Gabriel’s 2020 article Artificial Intelligence, Values, and Alignment is a philosophical perspective on what the goal of alignment actually is, and how we might accomplish it. In the best spirit of modern philosophy, it provides a helpful framework for organizing what has already been written about levels at which we might align AI systems, and also provides a neat set of connections between concepts in AI alignment and concepts in modern philosophy.
Goals of alignment
Gabriel identifies six levels at which we might define what it means to align AI with something:
-
Instructions: the agent does what I instruct it to do
-
Expressed intentions: the agent does what I intend it to do.
-
Revealed preferences: the agent does what my behaviour reveals I prefer.
-
Informed preferences or desires: the agent does what I would want it to do if I were rational and informed.
-
Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking.
-
Values: the agent does what it morally ought to do, as defined by the individual or society.
Schemas like this are helpful because they can “pop us out” from our unexamined paradigms. If we are, for example, having a discussion about building AI from inside the “revealed preferences” paradigm, it is good to know that we are having a discussion from inside that paradigm. It is a virtue of modern philosophy to always be asking what unexamined paradigm we are inside, and to be pushing us to at least see that that we are inside such-and-such a paradigm, in order that we can examine it and decide whether to keep working within it.
In that spirit, I would like to offer a conceptualization of the paradigm that I think all six of these levels are within, in order that we might examine that paradigm and decide whether we wish to keep working within it. It seems to me that we presuppose that when we deploy AI, we will pass agency away from humans and into the hands of the AI, at least for as long as it takes the AI to execute our instructions, intentions, preferences, interest, or values. We imagine our future AIs as assistants, genies, or agents with which we are going to have some initial period of contact, followed by a period during which these powerful agents go off and do our bidding, followed perhaps by later iterations in which these agents come back for further instructions, intentions, preferences, interests, or values. We understandably find it troubling to consider turning so much of our agency over to an external entity, yet most work in AI alignment is about how to safely navigate this hand-off of agency, and it seems to me that there is relatively little discussion of whether we should be doing all of our thinking on the assumption of a hand-off of agency. I will call this paradigm that I think we are inside the Agency Hand-off Paradigm:
The Agency Hand-off Paradigm
A few brief notes on how this relates to existing work in AI alignment:
-
The AI alignment sub-field of corrigibility is concerned with the design of AI that we can at least switch off if we later regret the instructions, intentions, preferences, interests, or values that we gave it. This is of course a good property for AI to have if we are going to hand off agency to it, but we seem to be inside the Agency Hand-off Paradigm almost by default.
-
Stuart Russell’s work on interaction games is about transmitting instructions / intentions / preferences / interests / values from humans to AIs as an ongoing dialog rather than a one-shot up-front data dump, but this work still assumes that agency is going to be handed over to our AIs, it’s just that the arrow from “Human” to “AI” in the figure above becomes a sequence of arrows.
-
Eliezer Yudkowsky’s writing on coherent extrapolated volition and Paul Christiano’s writing on indirect normativity are both concerned with extracting values from humans in a way that bypasses our limited ability to articulate our own values. Yet both bodies of work presuppose that there is going to be some phase during which we extract values from humans, followed by a phase during which our AIs are going to take actions on the basis of these values. Under this assumption we indeed ought to be very concerned about getting the value-extraction step right since the whole future of the world hangs on it.
-
Significant portions of Nick Bostrom’s book Superintelligence were concerned with the dangers of open-ended optimization over the world. It seems to me that the basic reason to be concerned about powerful optimizers in the first place is that they are precisely the category of system that has the property of taking agency away from humans.
But perhaps there is room to question the Agency Hand-off Paradigm. I would very much like to see proposals for AI alignment that escape completely from the assumption that we are going to hand off agency to AI. What would it look like to have powerful intelligent systems that increased rather than decreased the extent to which humans have agency over the future?
Think of a child playing a sand pit. The child’s parent has constructed the sand pit for the child and will keep the child safe. If the child happens to find, say, a shard of glass, then the parent may take it away. But for the most part the parent will just let the child play and learn and grow. It would be a little strange to think of the parent as taking instructions, intentions, preferences, interests, or values from the child and then assuming agency over the arrangement of sand in the sand pit on that basis. Yes the parent has a sense of what is in the child’s best interests by taking away the shard of glass, but not because the parent understands the child’s intentions how all the sand should ultimately be arranged and is accelerating things in that direction, but rather because the shard of glass threatens the child’s ow agency in a way that the child cannot account for in the short term. In the long term the parent will help the child to grow in such a way that they will be able to safely handle sharp objects on their own, and the parent will eventually fade away from the child’s life completely. The long-run flow of agency is towards the child, not towards the parent. Is it not possible that we could build AI that ensures that agency flows towards us, not away from us, over the long run?
I basically agree that humans ought to use AI to get space, safety and time to figure out what we want and grow into the people we want to be before making important decisions. This is (roughly) why I’m not concerned with some of the distinctions Gabriel raises, or that naturally come to mind when many people think of alignment.
That said, I feel your analogy misses a key point: while the child is playing in their sandbox, other stuff is happening in the world—people are building factories and armies, fighting wars and grabbing resources in space, and so on—and the child will inherit nothing at all unless their parent fights for it.
So without (fairly extreme) coordination, we need to figure out how to have the parent acquire resources and then ultimately “give” those resources to the child. It feels like that problem shouldn’t be much harder than the parent acquiring resources for themselves (I explore this intuition some in this post on the “strategy stealing” assumption), so that this just comes down to whether we can create a parent who is competent while being motivated to even try to help the child. That’s what I have in mind while working on the alignment problem.
On the other hand, given strong enough coordination that the parent doesn’t have to fight for their child, I think that the whole shape of the alignment problem changes in more profound ways.
I think that much existing research on alignment, and my research in particular, is embedded in the “agency hand-off paradigm” only to the extent that is necessitated by that situation.
I do agree that my post on indirect normativity is embedded in a stronger version of the agency hand-off paradigm. I think the main reason for taking an approach like that is that a human embedded in the physical world is a soft target for would-be attackers and creates a. If we are happy handing off control to a hypothetical version of ourselves in the imagination of our AI, then we can achieve additional security by doing so, and this may be more appealing than other mechanisms to achieve a similar level of security (like uploading or retreating to a secure physical sanctuary). In some sense all of this is just about saying what it means to ultimately “give” the resources to the child, and it does so by trying to construct an ideal environment for them to become wiser after which they will be mature enough to provide more direct instructions. (But in practice I think that these proposals may involve a jarring transition that could be avoided by using a physical sanctuary instead or just ensuring that our local environments remain hospitable.)
Overall it feels to me like you are coming from a similar place to where I was when I wrote this post on corrigibility, and I’m curious if there are places where you would part ways with that perspective (given the consideration I raised in this comment).
(I do think “aligned with who?” is a real question since the parent needs to decide which child will ultimately get the resources, or else if there are multiple children playing together then it matters a lot how the parent’s decisions shape the environment that will ultimately aggregate their preferences.)
Microscope AI (see here and here) is an AI alignment proposal that attempts to entirely avoid agency hand-off.
I also agree with Rohin’s comment that Paul-style corrigibility is at least trying to avoid a full agency hand-off, though it still has significantly more of an agency hand-off than something like microscope AI.
Thanks for this
Np! Also, just going through the rest of the proposals in my 11 proposals paper, I’m realizing that a lot of the other proposals also try to avoid a full agency hand-off. STEM AI restricts the AI’s agency to just STEM problems, narrow reward modeling restricts individual AIs to only apply their agency to narrow domains, and the amplification and debate proposals are trying to build corrigible question-answering systems rather than do a full agency hand-off.
Huh. I see a lot of the work I’m excited about as trying to avoid the Agency Hand-Off paradigm, and the Value Learning sequence was all about why we might hope to be able to avoid the Agency Hand-Off paradigm.
Your definition of corrigibility is the MIRI version. If you instead go by Paul’s post, the introduction goes:
This seems pretty clearly outside of the Agency Hand-Off paradigm to me (that’s the way I interpret it at least). Similarly for e.g. approval-directed agents.
I do agree that assistance games look like they’re within the Agency Hand-Off paradigm, especially the way Stuart talks about them; this is one of the main reasons I’m more excited about corrigibility.
Um, bad?
Humans aren’t fit to run the world, and there’s no reason to think humans can ever be fit to run the world. Not unless you deliberately modify them to the point where the word “human” becomes unreasonable.
The upside of AI depends on restricting human agency just as much as the downside does.
You seem to be relying on the idea that someday nobody will need to protect that child from a piece of glass, because the child’s agency will have been perfected. Someday the adult will be be able to take off all the restraints, stop trying to restrict the child’s actions at all, and treat the child as what we might call “sovereign”.
… but the example of the child is inapt. A child will grow up. The average child will become as capable of making good decisions as the average adult. In time, any particular child will probably get better than any particular adult, because the adult will be first to age to the point of real impairment.
The idea that a child will grow up is not a hope or a wish; it’s a factual prediction based on a great deal of experience. There’s a well-supported model of why a child is the way a child is and what will happen next.
On the other hand, the idea that adult humans can be made “better agents”, whether in the minimum, the maximum, or the mean, is a lot more like a wish. There’s just no reason to believe that. Humans have been talking about the need to get wiser for as long as there are records, and have little to show for it. What changes there have been in individual human action are arguably more due to better material conditions than to any improved ability to act correctly.
Humans may have improved their collective action. You might have a case to claim that governments, institutions, and “societies” take better actions than they did in the past. I’m not saying they actually do, but maybe you could make an argument for it. It still wouldn’t matter. Governments, institutions and “societies” are not humans. They’re instrumental constructs, just like you might hope an AI would be. A government has no more personality or value than a machine.
Actual, individual humans still have not improved. And even if they can improve, there’s no reason to think that they could ever improve so much that an AI, or even an institution, could properly take all restraints off of them. At least not if you take radical mind surgery off the table as a path to “improvement”.
Adult humans aren’t truly sovereign right now. You have comparatively wide freedom of action as an adult, but there are things that you won’t be allowed to do. There even processes for deciding that you’re defective in your ability to exercise your agency properly, and taking you back to childlike status.
The collective institutions spend a huge amount of time actively reducing and restricting the agency of real humans, and a bunch more time trying to modify the motivations and decision processes underlying that agency. They’ve always done that, and they don’t show any signs of stopping. In fact, they seem to be doing it more than they did in the past.
Institutions may have fine-tuned how they restrict individual agency. They may have managed to do it more when it helps and less when it hurts. But they haven’t given it up. Institutions don’t make individual adults sovereign, not even over themselves and definitely not in any matter that affects others.
It doesn’t seem plausible that institutions could keep improving outcomes if they did make individuals completely sovereign. So if you’ve seen any collective gains in the past, those gains have relied on constructed, non-human entities taking agency away from actual humans.
In fact, if your actions look threatening enough, even other individuals will try to restrain you, regardless of the institutions. None of us is willing to tolerate just anything that another human might decide to do, especially not if the effects extend beyond that person.
If you change the agent with the “upper hand” from an institution to an AI, there’s no clear reason to think that the basic rules change. An AI might have enough information, or enough raw power, to make it safe to allow humans more individual leeway than they have under existing institutions… but an AI can’t get away with making you totally sovereign any more than an institution can, or any more than another individual can. Not unless “making you sovereign” is itself the AI’s absolute, overriding goal… in which case it shouldn’t be waiting around to “improve” you before doing so.
There’s no point at which an AI with a practical goal system can tell anything recognizably human, “OK, you’ve grown up, so I won’t interfere if you want to destroy the world, make life miserable for your peers, or whatever”.
As for giving control to humans collectively, I don’t think it’s believable that institutions could improve to the point where a really powerful and intelligent AI could believe that those institutions would achieve better outcomes for actual humans than the AI could achieve itself. Not on any metric, including the amount of personal agency that could be granted to each individual. The AI is likely to expect to outperform the institutions, because the AI likely would outperform the institutions. Ceding control to humans collectively would just mean humans individually losing more agency… and more of other good stuff, too.
So if you’re the AI, and you want to do right by humans, then I think you’re going to have to stay in the saddle. Maybe you can back way, way off if some human self-modifies to become your peer, or your superior… but I don’t think that critter you back off from is going to be “human” any more.
I see this argument pop up every so often. I don’t find it persuasive because it presents a false choice in my view.
Our choice is not between having humans run the world and having a benevolent god run the world. Our choice is between having humans run the world, and having humans delegate the running of the world to something else (which is kind of just an indirect way of running the world).
If you think the alignment problem is hard, you probably believe that humans can’t be trusted to delegate to an AI, which means we are left with either having humans run the world (something humans can’t be trusted to do) or having humans build an AI to run the world (also something humans can’t be trusted to do).
The best path, in my view, is to pick and choose in order to make the overall task as easy as possible. If we’re having a hard time thinking of how to align an AI for a particular situation, add more human control. If we think humans are incompetent or untrustworthy in some particular circumstance, delegate to the AI in that circumstance.
It’s not obvious to me that becoming wiser is difficult—your comment is light on supporting evidence, violence seems less frequent nowadays, and it seems possible to me that becoming wiser is merely unincentivized, not difficult. (BTW, this is related to the question of how effective rationality training is.)
However, again, I see a false choice. We don’t have flawless computerized wisdom at the touch of a button. The alignment problem remains unsolved. What we do have are various exotic proposals for computerized wisdom (coherent extrapolated volition, indirect normativity) which are very difficult to test. Again, insofar as you believe the problem of aligning AIs with human values is hard, you should be pessimistic about these proposals working, and (relatively) eager to shift responsibility to systems we are more familiar with (biological humans).
Let’s take coherent extrapolated volition. We could try & specify some kind of exotic virtual environment where the AI can simulate idealized humans and observe their values… or we could become idealized humans. Given the knowledge of how to create a superintelligent AI, the second approach seems more robust to me. Both approaches require us to nail down what we mean by an “idealized human”, but the second approach does not include the added complication+difficulty of specifying a virtual environment, and has a flesh and blood “human in the loop” observing the process at every step, able to course correct if things seem to be going wrong.
The best overall approach might be a committee of ordinary humans, morally enhanced humans, and morally enhanced ems of some sort, where the AI only acts when all three parties agree on something (perhaps also preventing the parties from manipulating each other somehow). But anyway...
You talk about the influence of better material conditions and institutions. Fine, have the AI improve our material conditions and design better institutions. Again I see a false choice between outcomes achieved by institutions and outcomes achieved by a hypothetical aligned AI which doesn’t exist. Insofar as you think alignment is hard, you should be eager to make an AI less load-bearing and institutions more load-bearing.
Maybe we can have an “institutional singularity” where we have our AI generate a bunch of proposals for institutions, then we have our most trusted institution choose from amongst those proposals, we build the institution as proposed, then have that institution choose from amongst a new batch of institution proposals until we reach a fixed point. A little exotic, but I think I’ve got one foot on terra firma.
Right, I agree that having a benevolent god run the world is not within our choice set.
Well just to re-state the suggestion in my original post: is this dichotomy between humans running the world or something else running the world really so inescapable? The child in the sand pit does not really run the world, and in an important way the parent also does not run the world—certainly not from the perspective of the child’s whole-life trajectory.
I buy into the delegation framing, but I think that the best targets for delegation look more like “slightly older and wiser versions of ourselves with slightly more space” (who can themselves make decisions about whether to delegate to something more alien). In the sand-pit example, if the child opted into that arrangement then I would say they have effectively delegated to a version of themselves who is slightly constrained and shaped by the supervision of the adult. (But in the present situation, the most important thing is that the parent protects them from the outside the world while they have time to grow.)
I want to start my reply by saying I am dubious of the best future for humanity being one in which a super-intelligence we build ends up giving all control and decision making to humans. However, the tone of the post feels somewhat too anti-human (that a future where humans have greater agency is necessarily “bad”, not just sub-optimal) and narrow in its interpretation for me to move on without comment. There is a lot to be learned from considering the necessary conflict between human and FAI agency. Yes, conflict.
The first point I don’t fully agree with is the lack of capacity humans have to change, or grow, even as adults. You cite the lack of growth of wisdom in the human population even when people have been calling for it for millennia. There are many possible reasons for this besides humans being incapable of growth. For one, human psychology hasn’t been significantly understood for more than a century, let alone studied in detail with the instruments we have in modern times. One of the greatest difficulties of passing down wisdom is the act of teaching. Effective teaching usually has to be personal, tailored to an individual’s current state of mind, background knowledge, and skills. Even so, twin studies have found even such a rough measure as IQ to be 70% genetic and a whopping 30% environmental—but that is within an environment massively worse than perfect at teaching.
Further, practices found within, say, Buddhism show potential at increasing one’s capacity to act on abstract empathy. Jainism, as a religion, seems to have teachings and practices strong enough to make its group of followers one of the most peaceful groups on the planet, without material conditions being a necessary cause. These are within even current humans’ capacities to achieve. I will also point out the potential psychedelic substances have to allow adults who are set in their ways to break out, though the research is still relatively new and unexplored. I absolutely agree that human competence will never be able to compete with powerful AI competence, but that’s not really the point. Human values (probably) allow for non-Homo Sapiens to evolve out of—or be created by—”regular” old humans.
This is a good point to move on to institutions, and why human-led institutions tend to be so stifling. There are a few points to consider here, and I’d like to start with centralization versus decentralization. In the case of centralization, yes, we do have a situation where people are essentially being absorbed into a whole which acts much like a separate entity in its own right. However, with greater decentralization, where higher organizations are composed of collections of smaller organizations, which are composed of yet smaller orgs, and so on down to the individual, generally speaking individuals have greater agency. Of course, they must still participate in a game with other agents—but then we get into discussion of whether an agent has more or less agency in the wild with no assistance or working with other agents towards collective, shared values. I can’t even try to answer this is in a reply (or in a book), but the point is this isn’t cut and dry.
We also should consider the fact that current institutions are set up competitively, not only with competition between institutions but also with their individuals competing with each other. This of course leads to multipolar traps abound, with serious implications for the capacity for those institutions to be maximally effective. I think the question of whether institutions must necessarily act like this is an open one—and a problem we didn’t even know how to speak about until recently. I am reminded of Daniel Schachtenberger and “Game B”.
Finally, to the point about conflict. I’d like to consider instances where it makes sense for a Friendly AI to want to increase a human’s agency. To be clear, there is a vital difference between “increase” a quantity and “maximize” a quantity. A sufficiently Friendly AI should be able to recognize easily that human values need to be balanced against one another, and no one will preclude all others. With that said, agency is valuable to humans at least in the cases of:
i) Humans intrinsically valuing autonomy. Whether you believe we do, if it is true, a Friendly agent would wish to increase this to a sensible point.
ii) Human values may be computationally intractable, especially when discussing multiple humans. There is (as far was we know) a finite amount of accessible information in the universe. No matter how smart an AI gets, there are probably going to be limits on its ability to compute the best possible world. A human seems to be, in many senses, “closer” to their own values than an outsider looking in is. It is not necessarily true that a human’s actions and desires are perfectly simulatable—we are made of quantum objects, after all, and even with a theory of everything, that doesn’t mean literally anything is possible. It may be that it is more efficient to teach humans how to achieve their own happiness than for the AI to do everything for us, at least in cases of, say, fetching a banana from the table.
iii) An AI may wish to give its (potential) computational resources to humans, in order to increase their capacity to fulfill their values. This is really more of an edge case, but… there will surely be a point where something like “devoting x planet to y computronium versus having z human societies/persons” will need to be decided on. That is to say, at some point an agent will have to decide how much resources it is going to spend predicting human values and the path towards them versus having humans actually instantiate the appreciation of those values. If we’re talking about maximizing human values… would it not make sense for as much of the universe as possible be devoted to such things? Consider the paper clip maker which, at the end of its journey of making the universe worthless, finally turns itself into paper clips just to put a cherry on top and leave a universe unable to even appreciate its paper-clip-ness. Similarly, the FAI may want to back off at the extremes to increase the capacity for humans to enjoy their lives.
In terms of formalizing stuff of this nature, I am aware of at least one attempt in information theory with the concept of “empowerment”, which is a measure of the maximum possible mutual information between an agent’s potential future actions and its potential future state. It may be something to look into, though I don’t think its perfect.
Sorry for the length, but I hope it was at least thought-provoking.
I’d be concerned that our instincts toward vengeance in particular would result in extremely poor outcomes if you give humans near-unlimited power (which is mostly granted by being put in charge of an otherwise-sovereign AGI); one potential example is the AGI-controller sending a murderer to an artificial, semi-eternal version of Hell as punishment for his crimes. I believe there’s a Black Mirror episode exploring this. In a hypothetical AGI good outcome, this cannot occur.
The idea of a committee of ordinary humans, ems, and semi-aligned AI which are required to agree in order to perform actions does, in principle, avoid this failure mode. Though worth pointing out that this also requires AI-alignment to be solved—if the AGI is effectively a whatever-maximizer with the requirement that it gets agreement from the ems and humans before acting, it will acquire that agreement by whatever means, which brings up the question of whether the participation of the humans and ems is at all meaningful in this scenario.
The obvious retort is that the AGI would be a tool-ai without the desire to maximize any specific property of the world beyond the degree to which it obeys the humans and ems in the loop; the usual objections to the idea of tool-ais as an AI alignment solution apply here.
Another possible retort, which you bring up, is that the AGI needs to (instead of maximizing a specific property of the world) understand the balance between various competing human values, which is (as I understand it) another way of saying it needs to be capable of understanding and implementing coherent extrapolated volition. Which is fine, though if you posit this then CEV simply becomes the value to maximize instead; which brings us back to the objection that if it wants to maximize a thing conditional on getting consent from a human, it will figure out a way to do that. Which turns “AI governs the world, in a fashion overseen by humans and ems” into “AI governs the world.”
Thank you for this jbash.
My short response is: Yes, it would be very bad for present-day humanity to have more power than it currently does, since its current level of power is far out of proportion to its level of wisdom and compassion. But it seems to me that there are a small number of humans on this planet who have moved some way in the direction of being fit to run the world, and in time, more humans could move in this direction, and could move further. I would like to build the kind of AI that creates a safe container for movement in this direction, and then fades away as humans in fact become fit to run the world, however long or short that takes. If it turns out not to be possible then the AI should never fade away.
I think what it means to grow up is to not want to destroy the world or make life miserable for one’s peers. I do not think that most biological full-grown humans today have “grown up”.
Well just to state my position on this without arguing for it: my sense is that institutions should make individual adults sovereign if and when they grow up in the sense I’ve alluded to above. Very few currently living humans meet this bar in my opinion. In particular, I do not think that I meet this bar.
True, but whether or not the word “human” is a reasonable description of what a person becomes when they become fit to run the world, the question is really: can humans become fit to run the world, and should they? Based on the few individuals I’ve spent time with who seem, in my estimation, to have moved some way in the direction of being fit to the run world, I’d say: yes and yes.
If the humans in the container succeed in becoming wiser, then hopefully it is wise for us to leave this decision up to them than to preemptively make it now (and so I think the situation is even better than it sounds superficially).
It seems like the real thing up for debate will be about power struggles amongst humans—if we had just one human, then it seems to me like the grandparent’s position would be straightforwardly incoherent. This includes, in particular, competing views about what kind of structure we should use to govern ourselves in the future.
I’d definitely agree with this. Human institutions are very bad at making a lot of extremely crucial decisions; the Stanislav Petrov story, the Holocaust, and the prison system are all pretty good examples of cases where the institutions humans have created have (1) been invested with a ton of power, human and technological, and (2) made really terrible decisions with that power which either could have or did cause untold suffering.
Which I guess is mostly a longer way of saying +1.