Upvoted, though I do disagree with the (framework behind the) post.
Here’s a caricature of what I think is your view of AI alignment:
We will build powerful / superintelligent AI systems at some point in the future.
These AI systems will be optimizing for something. Since they are superintelligent, they will optimize very well for that thing, probably to the exclusion of all else.
Thus, it is extremely important that we make this “something” that they are optimizing for the right thing. In particular, we need to define some notion of “the good” that the AI system should optimize for.
Under this setting, alignment researchers should really be very careful about getting the right notion of “the good”, and should be appropriate modest and conservative around it, which in turn implies that we need pluralism.
I basically agree that under this view of the problem, the field as a whole is quite parochial and would benefit from plurality.
----
In contrast, I would think of AI alignment this way:
There will be some tasks, like “suggest good policies to handle the problem of rising student debt”, or “invest this pile of money to get high returns without too much risk”, that we will automate via AI.
As our AI systems become more and more competent, we will use them to automate tasks of larger and larger scope and impact. Eventually, we will have “superintelligent” systems that take on tasks of huge scope that individual humans could not do (“run this company in a fully automated way”).
For systems like this, we do not have a good story suggesting that these systems will “follow common sense”, or stay bounded in scope. There is some chance that they behave in a goal-directed way; if so they may choose plans that optimize against humans, potentially resulting in human extinction.
On my view, AI alignment is primarily about defusing this second argument. Importantly, in order to defuse it, we do not need to define “the good”—we need to provide a general method for creating AI systems that pursue some specific task, interpreted the way we meant it to be interpreted. Once we don’t have to define “the good”, many of the philosophical challenges relating to values and ethics go away.
The choice of how these AI systems are used happens the same way that such things happen so far: through a combination of market forces, government regulation, public pressure, etc. Humanity as a whole may want to have a more deliberate approach; this is the goal of AI governance work, for example. Note that technical work can be done towards this goal as well—the ARCHES agenda has lots of examples. But I wouldn’t call this part of AI alignment.
(I do think ontological shifts continue to be relevant to my description of the problem, but I’ve never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)
----
I should note that the view I’m espousing here may not be the majority view—I think it’s more likely that your view is more common amongst AI alignment researchers.
----
Some specific comments:
it’s based on the work that gets highlighted in venues like the Alignment Newsletter, or that gets discussed on the AI Alignment forum.
If you literally mean things in the highlights section of the newsletter, that’s “things that Rohin thinks AI alignment researchers should read”, which is heavily influenced by my evaluation of what’s important / relevant, as well as what I do / don’t understand. This still seems like a fair way to evaluate what the alignment community thinks about, but I think it is going to overestimate how parochial the community is. For example, if you go by “what does Stuart Russell think is important”, I expect you get a very different view on the field, much of which won’t be in the Alignment Newsletter.
I agree that alignment researchers tend to have this stance, but I don’t think their work does? Reward functions are typically allowed to depend on actions, and the alignment community is particularly likely to use reward functions on entire trajectories, which can express arbitrary views (though I agree that many views are not “naturally” expressed in this framework).
It’s worth noting that the first three of these tendencies are very much influenced by recent successes of deep reinforcement learning in AI. In fact, prior to these successes, a lot of work in AI was more on the other end of the spectrum: first order logic, classical planning, cognitive systems, etc. One worry then, is that the attention of AI alignment researchers might be unduly influenced by the success or popularity of contemporary AI paradigms.
(I’d cite deep learning generally, not just deep RL.)
If you start with an uninformative prior and no other evidence, it seems like you should be focusing a lot of attention on the paradigm that is most successful / popular. So why is this influence “undue”?
Thanks for these thoughts! I’ll respond to your disagreement with the framework here, and to the specific comments in a separate reply.
First, with respect to my view about the sources of AI risk, the characterization you’ve put forth isn’t quite accurate (though it’s a fair guess, since I wasn’t very explicit about it). In particular:
These days I’m actually more worried by structural risks and multi-multi alignment risks, which may be better addressed by AI governance than technical research per se. If we do reach super-intelligence, I think it’s more likely to be along the lines of CAIS than the kind of agential super-intelligence pictured by Bostrom. That said, I still think that technical AI alignment is important to get right, even in a CAIS-style future, hence this talk—I see it as necessary, but not sufficient.
I don’t think that powerful AI systems will necessarily be optimizing for anything (at least not in the agential sense suggested by Superintelligence). In fact, I think we should actively avoid building globally optimizing agents if possible—I think optimization is the wrong framework for thinking about “rationality” or “human values”, especially in multi-agent contexts. I think it’s still non-trivially likely that we’ll end up building AGI that’s optimizing in some way, just because that’s the understanding of “rationality” or “solving a task” that’s so predominant within AI research. But in my view, that’s precisely the problem, and my argument for philosophical pluralism is in part because it offers theories of rationality, value, and normativity that aren’t about “maximizing the good”.
Regarding “the good”, the primary worry I was trying to raise in this talk has less to do with “ethical error”, which can arise due to e.g. Goodhart’s curse, and more to do with meta-ethical and meta-normative error, i.e., that the formal concepts and frameworks that the AI alignment community has typically used to understand fuzzy terms like “value”, “rationality” and “normativity” might be off-the-mark.
For me, this sort of error is importantly different from the kind of error considered by inner and outer alignment. It’s often implicit in the mathematical foundations of decision theory and ML theory itself, and tends to go un-noticed. For example, once we define rationality as “maximize expected future reward”, or assume that human behavior reflects reward-rational implicit choice, we’re already making substantive commitments about the nature of “value” and “rationality” that preclude other plausible characterizations of these concepts, some of which I’ve highlighted in the talk. Of course, there has been plenty of discussion about whether these formalisms are in fact the right ones—and I think MIRI-style research has been especially valuable for clarifying our concepts of “agency” and “epistemic rationality”—but I’ve yet to see some of these alternative conceptions of “value” and “practical rationality” discussed heavily in AI alignment spaces.
Second, with respect to your characterization of AI development and AI risk, I believe that points 1 and 2 above suggest that our views don’t actually diverge that much. My worry is that the difficulty of building machines that “follow common sense” is on the same order of magnitude as “defining the good”, and just as beset by the meta-ethical and meta-normative worries I’ve raised above. After all, “common sense” is going to include “common social sense” and “common moral sense”, and this kind of knowledge is irreducibly normative. (In fact, I think there’s good reason to think that all knowledge and inquiry is irreducibly normative, but that’s a stronger and more contentious claim.)
Furthermore, given that AI is already deployed in social domains which tend to have open scope (personal assistants, collaborative and caretaking robots, legal AI, etc.), I think it’s a non-trivial possibility that we’ll end up having powerful misaligned AI applied to those contexts, and that either violate their intended scope, or require having wide scope to function well (e.g., personal assistants). No doubt, “follow common sense” is a lower bar than “solve moral philosophy”, but on the view that philosophy is just common sense applied to itself, solving “common sense” is already most of the problem. For that reason, I think it deserves a plurality of disciplinary* and philosophical perspectives as well.
(*On this note, I think cognitive science has a lot to offer with regard to understanding “common sense”. Perhaps I am overly partial given that I am in computational cognitive science lab, but it does feel like there’s insufficient awareness or discussion of cognitive scientific research within AI alignment spaces, despite its [IMO clearcut] relevance.)
I agree with you on 1 and 2 (and am perhaps more optimistic about not building globally optimizing agents; I actually see that as the “default” outcome).
My worry is that the difficulty of building machines that “follow common sense” is on the same order of magnitude as “defining the good”, and just as beset by the meta-ethical and meta-normative worries I’ve raised above.
I think this is where I disagree. I’d offer two main reasons not to believe this:
Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all. (Though you could argue that the relevant meta-ethical and meta-normative concepts are inherent in / embedded in / compiled into the human brain’s “priors” and learning algorithm.)
Intuitively, it seems like sufficiently good imitations of humans would have to have (perhaps implicit) knowledge of “common sense”. We can see this to some extent, where GPT-3 demonstrates implicit knowledge of at least some aspects of common sense (though I do not claim that it acts in accordance with common sense).
(As a sanity check, we can see that neither of these arguments would apply to the “learning human values” case.)
I’m going to assume that Quality Y is “normative” if determining whether an object X has quality Y depends on who is evaluating Y. Put another way, an independent race of aliens that had never encountered humans would probably not converge to the same judgments as we do about quality Y.
This feels similar to the is-ought distinction: you cannot determine “ought” facts from “is” facts, because “ought” facts are normative, whereas “is” facts are not (though perhaps you disagree with the latter).
I think “common sense is normative” is sufficient to argue that a race of aliens could not build an AI system that had our common sense, without either the aliens or the AI system figuring out the right meta-normative concepts for humanity (which they presumably could not do without encountering humans first).
I don’t see why it implies that we cannot build an AI system that has our common sense. Even if our common sense is normative, its effects are widespread; it should be possible in theory to back out the concept from its effects, and I don’t see a reason it would be impossible in practice (and in fact human children feel like a great example that it is possible in practice).
I suspect that on a symbolic account of knowledge, it becomes more important to have the right meta-normative principles (though I still wonder what one would say to the example of children). I also think cog sci would be an obvious line of attack on a symbolic account of knowledge; it feels less clear how relevant it is on a connectionist account. (Though I haven’t read the research in this space; it’s possible I’m just missing something basic.)
Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all.
Children also learn right from wrong—I’d be interested in where you draw the line between “An AI that learns common sense” and “An AI that learns right from wrong.” (You say this argument doesn’t apply in the case of human values, but it seems like you mean only explicit human values, not implicit ones.)
My suspicion, which is interesting to me so I’ll explain it even if you’re going to tell me that I’m off base, is that you’re thinking that part of common sense is to avoid uncertain or extreme situations (e.g. reshaping the galaxy with nanotechnology), and that common sense is generally safe and trustworthy for an AI to follow, in a way that doesn’t carry over to “knowing right from wrong.” An AI that has learned right from wrong to the same extent that humans learn it might make dangerous moral mistakes.
But when I think about it like that, it actually makes me less trusting of learned common sense. After all, one of the most universally acknowledged things about common sense is that it’s uncommon among humans! Merely doing common sense as well as humans seems like a recipe for making a horrible mistake because it seemed like the right thing at the time—this opens the door to the same old alignment problems (like self-reflection and meta-preferences [or should that be meta-common-sense]).
P.S. I’m not sure I quite agree with this particular setting of normativity. The reason is the possibility of “subjective objectivity”, where you can make what you mean by “Quality Y” arbitrarily precise and formal if given long enough to split hairs. Thus equipped, you can turn “Does this have quality Y?” into an objective question by checking against the (sufficiently) formal, precise definition.
The point is that the aliens are going to be able to evaluate this formal definition just as well as you. They just don’t care about it. Even if you both call something “Quality Y,” that doesn’t avail you much if you’re using that word to mean very different things. (Obligatory old Eliezer post)
Anyhow, I’d bet that xuan is not saying that it is impossible to build an AI with common sense—they’re saying that building an AI with common sense is in the same epistemological category as building an AI that knows right from wrong.
Children also learn right from wrong—I’d be interested in where you draw the line between “An AI that learns common sense” and “An AI that learns right from wrong.”
I’m happy to assume that AI will learn right from wrong to about the level that children do. This is not a sufficiently good definition of “the good” that we can then optimize it.
My suspicion, which is interesting to me so I’ll explain it even if you’re going to tell me that I’m off base, is that you’re thinking that part of common sense is to avoid uncertain or extreme situations (e.g. reshaping the galaxy with nanotechnology), and that common sense is generally safe and trustworthy for an AI to follow, in a way that doesn’t carry over to “knowing right from wrong.” An AI that has learned right from wrong to the same extent that humans learn it might make dangerous moral mistakes.
That sounds basically right, with the caveat that you want to be a bit more specific and precise with what the AI system should do than just saying “common sense”; I’m using the phrase as a placeholder for something more precise that we need to figure out.
Also, I’d change the last sentence to “an AI that has learned right from wrong to the same extent that humans learn it, and then optimizes for right things as hard as possible, will probably make dangerous moral mistakes”. The point is that when you’re trying to define “the good” and then optimize it, you need to be very very correct in your definition, whereas when you’re trying not to optimize too hard in the first place (which is part of what I mean by “common sense”) then that’s no longer the case.
After all, one of the most universally acknowledged things about common sense is that it’s uncommon among humans!
I think at this point I don’t think we’re talking about the same “common sense”.
Merely doing common sense as well as humans seems like a recipe for making a horrible mistake because it seemed like the right thing at the time—this opens the door to the same old alignment problems (like self-reflection and meta-preferences [or should that be meta-common-sense]).
But why?
they’re saying that building an AI with common sense is in the same epistemological category as building an AI that knows right from wrong.
Again it depends on how accurate the “right/wrong classifier” needs to be, and how accurate the “common sense” needs to be. My main claim is that the path to safety that goes via “common sense” is much more tolerant of inaccuracies than the path that goes through optimizing the output of the right/wrong classifier.
My first idea is, you take your common sense AI, and rather than saying “build me a spaceship, but, like, use common sense,” you can tell it “do the right thing, but, like, use common sense.” (Obviously with “saying” and “tell” in invisible finger quotes.) Bam, Type-1 FAI.
Of course, whether this will go wrong or not depends on the specifics. I’m reminded of Adam Shimi et al’s recent post that mentioned “Ideal Accomplishment” (how close to an explicit goal a system eventually gets) and “Efficiency” (how fast it gets there). If you have a general purpose “common sensical optimizer” that optimizes any goal but, like, does it in a common sense way, then before you turn it on you’d better know whether it’s affecting ideal accomplishment, or just efficiency.
That is to say, if I tell it to make me the best spaceship it can or something similarly stupid, will the AI “know that the goal is stupid” and only make a normal spaceship before stopping? Or will it eventually turn the galaxy into spaceship, just taking common-sense actions along the way? The truly idiot-proof common sensical optimizer changes its final destination so that it does what we “obviously” meant, not what we actually said. The flaws in this process seem to determine if it’s trustworthy enough to tell to “do the right thing,” or trustworthy enough to tell to do anything at all.
This still seems like a fair way to evaluate what the alignment community thinks about, but I think it is going to overestimate how parochial the community is. For example, if you go by “what does Stuart Russell think is important”, I expect you get a very different view on the field, much of which won’t be in the Alignment Newsletter.
I agree. I intended to gesture a little bit at this when I mentioned that “Until more recently, It’s also been excluded and not taken very seriously within traditional academia”, because I think one source of greater diversity has been the uptake of AI alignment in traditional academia, leading to slightly more inter-disciplinary work, as well as a greater diversity of AI approaches. I happen to think that CHAI’s research publications page reflects more of the diversity of approaches I would like to see, and wish that more new researchers were aware of them (as opposed to the advice currently given by, e.g., 80K, which is to skill up in deep learning and deep RL).
Reward functions are typically allowed to depend on actions, and the alignment community is particularly likely to use reward functions on entire trajectories, which can express arbitrary views (though I agree that many views are not “naturally” expressed in this framework).
Yup, I think purely at the level of expressivity, reward functions on a sufficiently extended state space can express basically anything you want. That still doesn’t resolve several worries I have though:
Talking about all human motivation using “rewards” tends to promote certain (behaviorist / Humean) patterns of thought over others. In particular I think it tends to obscure the logical and hierarchical structure of many aspects of human motivation—e.g., that many of our goals are actually instrumental sub-goals in higher-level plans, and that we can cite reasons for believing, wanting, or planning to do a certain thing. I would prefer if people used terms like “reasons for action” and “motivational states”, rather than simply “reward functions”.
Even if reward functions can express everything you want them to, that doesn’t mean they’ll be able to learn everything you want them to, or generalize in the appropriate ways. e.g., I think deep RL agents are unlikely to learn the concept of “promises” in a way that generalizes robustly, unless you give them some kind of inductive bias that leads them to favor structures like LTL formulas (This is a related worry to Stuart Armstrong’s no-free-lunch theorem.) At some point I intend to write a longer post about this worry.
Of course, you could just define reward functions over logical formulas and the like, and do something like reward modeling via program induction, but at that point you’re no longer using “reward” in the way its typically understood. (This is similar to move, made by some Humeans, that reason can only be motivating because we desire to follow reason. That’s fair enough, but misses the point for calling certain kinds of motivations “reasons” at all.)
(I’d cite deep learning generally, not just deep RL.)
You’re right, that’s what I meant, and have updated the post accordingly.
If you start with an uninformative prior and no other evidence, it seems like you should be focusing a lot of attention on the paradigm that is most successful / popular. So why is this influence “undue”?
I agree that if you start with a very uninformative prior, focusing on the most recently successful paradigm makes sense. But I think once you take into account slightly more information, I think there’s reason to think the AI alignment community is currently overly biased towards deep learning:
The trend-following behavior in most scientific & engineering fields, including AI, should make us skeptical that currently popular approaches are popular for the right reasons. In the 80s everyone was really excited about expert systems and the 5th generation project. About 10 years ago, Bayesian non-parametrics were really popular. Now deep learning is popular. Knowing this history suggests that we should be a little more careful about joining the bandwagon. Unfortunately, a lot of us joining the field now don’t really know this history, nor are we necessarily exposed to the richness and breadth of older approaches before diving headfirst into deep learning (I only recognized this after starting my PhD and started learning more about symbolic AI planning and programming languages research).
We have extra reason to be cautious about deep learning being popular for the wrong reasons, given that many AI researchers say that we should be focusing less on machine learning while at the same time publishing heavily in machine learning. For example, at the AAAI 2019 informal debate, the majority of audience members voted against the proposition that “The AI community today should continue to focus mostly on ML methods”. At some point during the debate, it was noted that despite the opposition to ML, most papers at AAAI that year were about ML, and it was suggested, to some laughter, that people were publishing in ML simply because that’s what would get them published.
The diversity of expert opinion about whether deep learning will get us to AGI doesn’t feel adequately reflected in the current AI alignment community. Not everyone thinks the Bitter Lesson is quite the lesson we have to learn at. A lot of of prominent researchers like Stuart Russell, Gary Marcus, and Josh Tenenbaum all think that we need to re-invigorate symbolic and Bayesian approaches (perhaps through hybrid neuro-symbolic methods), and if you watch the 2019 Turing Award keynotes by both Hinton and Bengio, both of them emphasize the importance of having structured generative models of the world (they just happen to think it can be achieved by building the right inductive biases into neural networks). In contrast, outside of MIRI, it feels like a lot of the alignment community anchors towards the work that’s coming out of OpenAI and DeepMind.
My own view is that the success of deep learning should be taken in perspective. It’s good for certain things, and certain high-data training regimes, and will remain good for those use cases. But in a lot of other use cases, where we might care a lot about sample efficiency and rapid + robust generalizability, most of the recent progress has, in my view, been made by cleverly integrating symbolic approaches with neural networks (even AlphaGo can be seen as a version of this, if one views MCTS as symbolic). I expect future AI advances to occur in a similar vein, and for me that lowers the relevance of ensuring that end-to-end DL approaches are safe and robust.
Re: worries about “reward”, I don’t feel like I have a great understanding of what your worry is, but I’d try to summarize it as “while the abstraction of reward is technically sufficiently expressive, 1) it may not have the right inductive biases, and so the framework might fail in practice, and 2) it is not a good framework for thought, because it doesn’t sufficiently emphasize many important concepts like logic and hierarchical planning”.
I think I broadly agree with those points if our plan is to explicitly learn human values, but it seems less relevant when we aren’t trying to do that and are instead trying to
provide a general method for creating AI systems that pursue some specific task, interpreted the way we meant it to be interpreted.
In this framework, “knowledge about what humans want” doesn’t come from a reward function, it comes from something like GPT-3 pretraining. The AI system can “invent” whatever concepts are best for representing its knowledge, which includes what humans want.
Here, reward functions should instead be thought of as akin to loss functions—they are ways of incentivizing particular kinds of outputs. I think it’s reasonable to think on priors that this wouldn’t be sufficient to get logical / hierarchical behavior, but I think GPT and AlphaStar and all the other recent successes should make you rethink that judgment.
----
The trend-following behavior in most scientific & engineering fields, including AI, should make us skeptical that currently popular approaches are popular for the right reasons.
I agree that trend-following behavior exists. I agree that this means that work on deep learning is less promising than you might otherwise think. That doesn’t mean it’s the wrong decision; if there are a hundred other plausible directions, it can still be the case that it’s better to bet on deep learning rather than try your hand at guessing which paradigm will become dominant next. To quote Rodney Brooks:
Whatever [the “next big thing”] turns out to be, it will be something that someone is already working on, and there are already published papers about it. There will be many claims on this title earlier than 2023, but none of them will pan out.
He also predicts that the “next big thing” will happen by 2027 (though I get the sense that he might count new kinds of deep learning architectures as a “big thing” so he may not be predicting something as paradigm-shifting as you’re thinking).
Whether to diversify depends on the size of the field; if you have 1 million alignment researchers you definitely want to diversify, whereas at 5 researchers you almost certainly don’t; I’m claiming that we’re small enough now and uninformed enough about alternatives to deep learning that diversification is not a great approach.
We have extra reason to be cautious about deep learning being popular for the wrong reasons, given that many AI researchers say that we should be focusing less on machine learning while at the same time publishing heavily in machine learning.
Just because AI research should diversify doesn’t mean alignment research should diversify—given their relative sizes, it seems correct for alignment researchers to focus on the dominant paradigm while letting AI researchers explore the space of possible ways to build AI. Alignment researchers should then be ready to switch paradigms if a new one is found.
A lot of of prominent researchers like Stuart Russell, Gary Marcus, and Josh Tenenbaum all think that we need to re-invigorate symbolic and Bayesian approaches (perhaps through hybrid neuro-symbolic methods)
This feels like the most compelling argument, since it identifies particular other approaches (though still very large ones). Some objections from the outside view:
I think all three of the researchers you mentioned have long timelines; work is generally more useful on shorter timelines, this should bias you towards what is currently popular. Some of these researchers don’t think we can get to AGI at all; as long as you aren’t confident that they are correct, you should ignore that position (if we’re in that world, then there isn’t any AI alignment x-risk, so it isn’t decision-relevant).
I find the arguments given by these researchers to be relatively weak and easily countered, and am more inclined to use inside-view arguments as a result. (Though I should note that I think that it is often correct to trust in an expert even when their arguments seem weak, so this is a relatively minor point.)
(Re: Hinton and Bengio, I feel like that’s in support of the work that’s currently being done; the work that comes out of those labs doesn’t seem that different from what comes out of OpenAI and DeepMind.)
Going to the inside view on neurosymbolic AI:
(even AlphaGo can be seen as a version of this, if one views MCTS as symbolic)
Overall, I do expect that neurosymbolic approaches will be helpful and used in many practical AI applications; they allow you to encode relevant domain knowledge without having to learn it all from scratch. I don’t currently see that it introduces new alignment problems, or changes how we should think about the existing problems that we work on, and that’s the main reason I don’t focus on it. But I certainly agree with that as a background model of what future AI systems will look like, and if someone identified a problem that happens with neurosymbolic AI that isn’t addressed by current work in AI alignment, I’d be pretty excited to see research solving that problem, and might do it myself.
----
Things I do agree with:
It would be significantly better if the average / median commenter on the Alignment Forum knew more about AI techniques. (I think this is also true of deep learning.)
There will probably be something in the future that radically changes our beliefs about AGI.
(I do think ontological shifts continue to be relevant to my description of the problem, but I’ve never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)
I think I agree at least that many problems can be seen this way, but I suspect that other framings are more useful for solutions. (I don’t think I can explain why here, though I am working on a longer explanation of what framings I like and why.)
What I was claiming in the sentence you quoted was that I don’t see ontological shifts as a huge additional category of problem that isn’t covered by other problems, which is compatible with saying that ontological shifts can also represent many other problems.
Upvoted, though I do disagree with the (framework behind the) post.
Here’s a caricature of what I think is your view of AI alignment:
We will build powerful / superintelligent AI systems at some point in the future.
These AI systems will be optimizing for something. Since they are superintelligent, they will optimize very well for that thing, probably to the exclusion of all else.
Thus, it is extremely important that we make this “something” that they are optimizing for the right thing. In particular, we need to define some notion of “the good” that the AI system should optimize for.
Under this setting, alignment researchers should really be very careful about getting the right notion of “the good”, and should be appropriate modest and conservative around it, which in turn implies that we need pluralism.
I basically agree that under this view of the problem, the field as a whole is quite parochial and would benefit from plurality.
----
In contrast, I would think of AI alignment this way:
There will be some tasks, like “suggest good policies to handle the problem of rising student debt”, or “invest this pile of money to get high returns without too much risk”, that we will automate via AI.
As our AI systems become more and more competent, we will use them to automate tasks of larger and larger scope and impact. Eventually, we will have “superintelligent” systems that take on tasks of huge scope that individual humans could not do (“run this company in a fully automated way”).
For systems like this, we do not have a good story suggesting that these systems will “follow common sense”, or stay bounded in scope. There is some chance that they behave in a goal-directed way; if so they may choose plans that optimize against humans, potentially resulting in human extinction.
On my view, AI alignment is primarily about defusing this second argument. Importantly, in order to defuse it, we do not need to define “the good”—we need to provide a general method for creating AI systems that pursue some specific task, interpreted the way we meant it to be interpreted. Once we don’t have to define “the good”, many of the philosophical challenges relating to values and ethics go away.
The choice of how these AI systems are used happens the same way that such things happen so far: through a combination of market forces, government regulation, public pressure, etc. Humanity as a whole may want to have a more deliberate approach; this is the goal of AI governance work, for example. Note that technical work can be done towards this goal as well—the ARCHES agenda has lots of examples. But I wouldn’t call this part of AI alignment.
(I do think ontological shifts continue to be relevant to my description of the problem, but I’ve never been convinced that we should be particularly worried about ontological shifts, except inasmuch as they are one type of possible inner alignment / robustness failure.)
----
I should note that the view I’m espousing here may not be the majority view—I think it’s more likely that your view is more common amongst AI alignment researchers.
----
Some specific comments:
If you literally mean things in the highlights section of the newsletter, that’s “things that Rohin thinks AI alignment researchers should read”, which is heavily influenced by my evaluation of what’s important / relevant, as well as what I do / don’t understand. This still seems like a fair way to evaluate what the alignment community thinks about, but I think it is going to overestimate how parochial the community is. For example, if you go by “what does Stuart Russell think is important”, I expect you get a very different view on the field, much of which won’t be in the Alignment Newsletter.
I agree that alignment researchers tend to have this stance, but I don’t think their work does? Reward functions are typically allowed to depend on actions, and the alignment community is particularly likely to use reward functions on entire trajectories, which can express arbitrary views (though I agree that many views are not “naturally” expressed in this framework).
(I’d cite deep learning generally, not just deep RL.)
If you start with an uninformative prior and no other evidence, it seems like you should be focusing a lot of attention on the paradigm that is most successful / popular. So why is this influence “undue”?
Thanks for these thoughts! I’ll respond to your disagreement with the framework here, and to the specific comments in a separate reply.
First, with respect to my view about the sources of AI risk, the characterization you’ve put forth isn’t quite accurate (though it’s a fair guess, since I wasn’t very explicit about it). In particular:
These days I’m actually more worried by structural risks and multi-multi alignment risks, which may be better addressed by AI governance than technical research per se. If we do reach super-intelligence, I think it’s more likely to be along the lines of CAIS than the kind of agential super-intelligence pictured by Bostrom. That said, I still think that technical AI alignment is important to get right, even in a CAIS-style future, hence this talk—I see it as necessary, but not sufficient.
I don’t think that powerful AI systems will necessarily be optimizing for anything (at least not in the agential sense suggested by Superintelligence). In fact, I think we should actively avoid building globally optimizing agents if possible—I think optimization is the wrong framework for thinking about “rationality” or “human values”, especially in multi-agent contexts. I think it’s still non-trivially likely that we’ll end up building AGI that’s optimizing in some way, just because that’s the understanding of “rationality” or “solving a task” that’s so predominant within AI research. But in my view, that’s precisely the problem, and my argument for philosophical pluralism is in part because it offers theories of rationality, value, and normativity that aren’t about “maximizing the good”.
Regarding “the good”, the primary worry I was trying to raise in this talk has less to do with “ethical error”, which can arise due to e.g. Goodhart’s curse, and more to do with meta-ethical and meta-normative error, i.e., that the formal concepts and frameworks that the AI alignment community has typically used to understand fuzzy terms like “value”, “rationality” and “normativity” might be off-the-mark.
For me, this sort of error is importantly different from the kind of error considered by inner and outer alignment. It’s often implicit in the mathematical foundations of decision theory and ML theory itself, and tends to go un-noticed. For example, once we define rationality as “maximize expected future reward”, or assume that human behavior reflects reward-rational implicit choice, we’re already making substantive commitments about the nature of “value” and “rationality” that preclude other plausible characterizations of these concepts, some of which I’ve highlighted in the talk. Of course, there has been plenty of discussion about whether these formalisms are in fact the right ones—and I think MIRI-style research has been especially valuable for clarifying our concepts of “agency” and “epistemic rationality”—but I’ve yet to see some of these alternative conceptions of “value” and “practical rationality” discussed heavily in AI alignment spaces.
Second, with respect to your characterization of AI development and AI risk, I believe that points 1 and 2 above suggest that our views don’t actually diverge that much. My worry is that the difficulty of building machines that “follow common sense” is on the same order of magnitude as “defining the good”, and just as beset by the meta-ethical and meta-normative worries I’ve raised above. After all, “common sense” is going to include “common social sense” and “common moral sense”, and this kind of knowledge is irreducibly normative. (In fact, I think there’s good reason to think that all knowledge and inquiry is irreducibly normative, but that’s a stronger and more contentious claim.)
Furthermore, given that AI is already deployed in social domains which tend to have open scope (personal assistants, collaborative and caretaking robots, legal AI, etc.), I think it’s a non-trivial possibility that we’ll end up having powerful misaligned AI applied to those contexts, and that either violate their intended scope, or require having wide scope to function well (e.g., personal assistants). No doubt, “follow common sense” is a lower bar than “solve moral philosophy”, but on the view that philosophy is just common sense applied to itself, solving “common sense” is already most of the problem. For that reason, I think it deserves a plurality of disciplinary* and philosophical perspectives as well.
(*On this note, I think cognitive science has a lot to offer with regard to understanding “common sense”. Perhaps I am overly partial given that I am in computational cognitive science lab, but it does feel like there’s insufficient awareness or discussion of cognitive scientific research within AI alignment spaces, despite its [IMO clearcut] relevance.)
I agree with you on 1 and 2 (and am perhaps more optimistic about not building globally optimizing agents; I actually see that as the “default” outcome).
I think this is where I disagree. I’d offer two main reasons not to believe this:
Children learn to follow common sense, despite not having (explicit) meta-ethical and meta-normative beliefs at all. (Though you could argue that the relevant meta-ethical and meta-normative concepts are inherent in / embedded in / compiled into the human brain’s “priors” and learning algorithm.)
Intuitively, it seems like sufficiently good imitations of humans would have to have (perhaps implicit) knowledge of “common sense”. We can see this to some extent, where GPT-3 demonstrates implicit knowledge of at least some aspects of common sense (though I do not claim that it acts in accordance with common sense).
(As a sanity check, we can see that neither of these arguments would apply to the “learning human values” case.)
I’m going to assume that Quality Y is “normative” if determining whether an object X has quality Y depends on who is evaluating Y. Put another way, an independent race of aliens that had never encountered humans would probably not converge to the same judgments as we do about quality Y.
This feels similar to the is-ought distinction: you cannot determine “ought” facts from “is” facts, because “ought” facts are normative, whereas “is” facts are not (though perhaps you disagree with the latter).
I think “common sense is normative” is sufficient to argue that a race of aliens could not build an AI system that had our common sense, without either the aliens or the AI system figuring out the right meta-normative concepts for humanity (which they presumably could not do without encountering humans first).
I don’t see why it implies that we cannot build an AI system that has our common sense. Even if our common sense is normative, its effects are widespread; it should be possible in theory to back out the concept from its effects, and I don’t see a reason it would be impossible in practice (and in fact human children feel like a great example that it is possible in practice).
I suspect that on a symbolic account of knowledge, it becomes more important to have the right meta-normative principles (though I still wonder what one would say to the example of children). I also think cog sci would be an obvious line of attack on a symbolic account of knowledge; it feels less clear how relevant it is on a connectionist account. (Though I haven’t read the research in this space; it’s possible I’m just missing something basic.)
Children also learn right from wrong—I’d be interested in where you draw the line between “An AI that learns common sense” and “An AI that learns right from wrong.” (You say this argument doesn’t apply in the case of human values, but it seems like you mean only explicit human values, not implicit ones.)
My suspicion, which is interesting to me so I’ll explain it even if you’re going to tell me that I’m off base, is that you’re thinking that part of common sense is to avoid uncertain or extreme situations (e.g. reshaping the galaxy with nanotechnology), and that common sense is generally safe and trustworthy for an AI to follow, in a way that doesn’t carry over to “knowing right from wrong.” An AI that has learned right from wrong to the same extent that humans learn it might make dangerous moral mistakes.
But when I think about it like that, it actually makes me less trusting of learned common sense. After all, one of the most universally acknowledged things about common sense is that it’s uncommon among humans! Merely doing common sense as well as humans seems like a recipe for making a horrible mistake because it seemed like the right thing at the time—this opens the door to the same old alignment problems (like self-reflection and meta-preferences [or should that be meta-common-sense]).
P.S. I’m not sure I quite agree with this particular setting of normativity. The reason is the possibility of “subjective objectivity”, where you can make what you mean by “Quality Y” arbitrarily precise and formal if given long enough to split hairs. Thus equipped, you can turn “Does this have quality Y?” into an objective question by checking against the (sufficiently) formal, precise definition.
The point is that the aliens are going to be able to evaluate this formal definition just as well as you. They just don’t care about it. Even if you both call something “Quality Y,” that doesn’t avail you much if you’re using that word to mean very different things. (Obligatory old Eliezer post)
Anyhow, I’d bet that xuan is not saying that it is impossible to build an AI with common sense—they’re saying that building an AI with common sense is in the same epistemological category as building an AI that knows right from wrong.
I’m happy to assume that AI will learn right from wrong to about the level that children do. This is not a sufficiently good definition of “the good” that we can then optimize it.
That sounds basically right, with the caveat that you want to be a bit more specific and precise with what the AI system should do than just saying “common sense”; I’m using the phrase as a placeholder for something more precise that we need to figure out.
Also, I’d change the last sentence to “an AI that has learned right from wrong to the same extent that humans learn it, and then optimizes for right things as hard as possible, will probably make dangerous moral mistakes”. The point is that when you’re trying to define “the good” and then optimize it, you need to be very very correct in your definition, whereas when you’re trying not to optimize too hard in the first place (which is part of what I mean by “common sense”) then that’s no longer the case.
I think at this point I don’t think we’re talking about the same “common sense”.
But why?
Again it depends on how accurate the “right/wrong classifier” needs to be, and how accurate the “common sense” needs to be. My main claim is that the path to safety that goes via “common sense” is much more tolerant of inaccuracies than the path that goes through optimizing the output of the right/wrong classifier.
My first idea is, you take your common sense AI, and rather than saying “build me a spaceship, but, like, use common sense,” you can tell it “do the right thing, but, like, use common sense.” (Obviously with “saying” and “tell” in invisible finger quotes.) Bam, Type-1 FAI.
Of course, whether this will go wrong or not depends on the specifics. I’m reminded of Adam Shimi et al’s recent post that mentioned “Ideal Accomplishment” (how close to an explicit goal a system eventually gets) and “Efficiency” (how fast it gets there). If you have a general purpose “common sensical optimizer” that optimizes any goal but, like, does it in a common sense way, then before you turn it on you’d better know whether it’s affecting ideal accomplishment, or just efficiency.
That is to say, if I tell it to make me the best spaceship it can or something similarly stupid, will the AI “know that the goal is stupid” and only make a normal spaceship before stopping? Or will it eventually turn the galaxy into spaceship, just taking common-sense actions along the way? The truly idiot-proof common sensical optimizer changes its final destination so that it does what we “obviously” meant, not what we actually said. The flaws in this process seem to determine if it’s trustworthy enough to tell to “do the right thing,” or trustworthy enough to tell to do anything at all.
Replying to the specific comments:
I agree. I intended to gesture a little bit at this when I mentioned that “Until more recently, It’s also been excluded and not taken very seriously within traditional academia”, because I think one source of greater diversity has been the uptake of AI alignment in traditional academia, leading to slightly more inter-disciplinary work, as well as a greater diversity of AI approaches. I happen to think that CHAI’s research publications page reflects more of the diversity of approaches I would like to see, and wish that more new researchers were aware of them (as opposed to the advice currently given by, e.g., 80K, which is to skill up in deep learning and deep RL).
Yup, I think purely at the level of expressivity, reward functions on a sufficiently extended state space can express basically anything you want. That still doesn’t resolve several worries I have though:
Talking about all human motivation using “rewards” tends to promote certain (behaviorist / Humean) patterns of thought over others. In particular I think it tends to obscure the logical and hierarchical structure of many aspects of human motivation—e.g., that many of our goals are actually instrumental sub-goals in higher-level plans, and that we can cite reasons for believing, wanting, or planning to do a certain thing. I would prefer if people used terms like “reasons for action” and “motivational states”, rather than simply “reward functions”.
Even if reward functions can express everything you want them to, that doesn’t mean they’ll be able to learn everything you want them to, or generalize in the appropriate ways. e.g., I think deep RL agents are unlikely to learn the concept of “promises” in a way that generalizes robustly, unless you give them some kind of inductive bias that leads them to favor structures like LTL formulas (This is a related worry to Stuart Armstrong’s no-free-lunch theorem.) At some point I intend to write a longer post about this worry.
Of course, you could just define reward functions over logical formulas and the like, and do something like reward modeling via program induction, but at that point you’re no longer using “reward” in the way its typically understood. (This is similar to move, made by some Humeans, that reason can only be motivating because we desire to follow reason. That’s fair enough, but misses the point for calling certain kinds of motivations “reasons” at all.)
You’re right, that’s what I meant, and have updated the post accordingly.
I agree that if you start with a very uninformative prior, focusing on the most recently successful paradigm makes sense. But I think once you take into account slightly more information, I think there’s reason to think the AI alignment community is currently overly biased towards deep learning:
The trend-following behavior in most scientific & engineering fields, including AI, should make us skeptical that currently popular approaches are popular for the right reasons. In the 80s everyone was really excited about expert systems and the 5th generation project. About 10 years ago, Bayesian non-parametrics were really popular. Now deep learning is popular. Knowing this history suggests that we should be a little more careful about joining the bandwagon. Unfortunately, a lot of us joining the field now don’t really know this history, nor are we necessarily exposed to the richness and breadth of older approaches before diving headfirst into deep learning (I only recognized this after starting my PhD and started learning more about symbolic AI planning and programming languages research).
We have extra reason to be cautious about deep learning being popular for the wrong reasons, given that many AI researchers say that we should be focusing less on machine learning while at the same time publishing heavily in machine learning. For example, at the AAAI 2019 informal debate, the majority of audience members voted against the proposition that “The AI community today should continue to focus mostly on ML methods”. At some point during the debate, it was noted that despite the opposition to ML, most papers at AAAI that year were about ML, and it was suggested, to some laughter, that people were publishing in ML simply because that’s what would get them published.
The diversity of expert opinion about whether deep learning will get us to AGI doesn’t feel adequately reflected in the current AI alignment community. Not everyone thinks the Bitter Lesson is quite the lesson we have to learn at. A lot of of prominent researchers like Stuart Russell, Gary Marcus, and Josh Tenenbaum all think that we need to re-invigorate symbolic and Bayesian approaches (perhaps through hybrid neuro-symbolic methods), and if you watch the 2019 Turing Award keynotes by both Hinton and Bengio, both of them emphasize the importance of having structured generative models of the world (they just happen to think it can be achieved by building the right inductive biases into neural networks). In contrast, outside of MIRI, it feels like a lot of the alignment community anchors towards the work that’s coming out of OpenAI and DeepMind.
My own view is that the success of deep learning should be taken in perspective. It’s good for certain things, and certain high-data training regimes, and will remain good for those use cases. But in a lot of other use cases, where we might care a lot about sample efficiency and rapid + robust generalizability, most of the recent progress has, in my view, been made by cleverly integrating symbolic approaches with neural networks (even AlphaGo can be seen as a version of this, if one views MCTS as symbolic). I expect future AI advances to occur in a similar vein, and for me that lowers the relevance of ensuring that end-to-end DL approaches are safe and robust.
Re: worries about “reward”, I don’t feel like I have a great understanding of what your worry is, but I’d try to summarize it as “while the abstraction of reward is technically sufficiently expressive, 1) it may not have the right inductive biases, and so the framework might fail in practice, and 2) it is not a good framework for thought, because it doesn’t sufficiently emphasize many important concepts like logic and hierarchical planning”.
I think I broadly agree with those points if our plan is to explicitly learn human values, but it seems less relevant when we aren’t trying to do that and are instead trying to
In this framework, “knowledge about what humans want” doesn’t come from a reward function, it comes from something like GPT-3 pretraining. The AI system can “invent” whatever concepts are best for representing its knowledge, which includes what humans want.
Here, reward functions should instead be thought of as akin to loss functions—they are ways of incentivizing particular kinds of outputs. I think it’s reasonable to think on priors that this wouldn’t be sufficient to get logical / hierarchical behavior, but I think GPT and AlphaStar and all the other recent successes should make you rethink that judgment.
----
I agree that trend-following behavior exists. I agree that this means that work on deep learning is less promising than you might otherwise think. That doesn’t mean it’s the wrong decision; if there are a hundred other plausible directions, it can still be the case that it’s better to bet on deep learning rather than try your hand at guessing which paradigm will become dominant next. To quote Rodney Brooks:
He also predicts that the “next big thing” will happen by 2027 (though I get the sense that he might count new kinds of deep learning architectures as a “big thing” so he may not be predicting something as paradigm-shifting as you’re thinking).
Whether to diversify depends on the size of the field; if you have 1 million alignment researchers you definitely want to diversify, whereas at 5 researchers you almost certainly don’t; I’m claiming that we’re small enough now and uninformed enough about alternatives to deep learning that diversification is not a great approach.
Just because AI research should diversify doesn’t mean alignment research should diversify—given their relative sizes, it seems correct for alignment researchers to focus on the dominant paradigm while letting AI researchers explore the space of possible ways to build AI. Alignment researchers should then be ready to switch paradigms if a new one is found.
This feels like the most compelling argument, since it identifies particular other approaches (though still very large ones). Some objections from the outside view:
I think all three of the researchers you mentioned have long timelines; work is generally more useful on shorter timelines, this should bias you towards what is currently popular. Some of these researchers don’t think we can get to AGI at all; as long as you aren’t confident that they are correct, you should ignore that position (if we’re in that world, then there isn’t any AI alignment x-risk, so it isn’t decision-relevant).
I find the arguments given by these researchers to be relatively weak and easily countered, and am more inclined to use inside-view arguments as a result. (Though I should note that I think that it is often correct to trust in an expert even when their arguments seem weak, so this is a relatively minor point.)
(Re: Hinton and Bengio, I feel like that’s in support of the work that’s currently being done; the work that comes out of those labs doesn’t seem that different from what comes out of OpenAI and DeepMind.)
Going to the inside view on neurosymbolic AI:
I feel like if you endorse this then you should also think of iterated amplification as neurosymbolic (though maybe you think if humans are involved that’s “neurohuman” rather than neurosymbolic and the distinction is relevant for some reason).
Overall, I do expect that neurosymbolic approaches will be helpful and used in many practical AI applications; they allow you to encode relevant domain knowledge without having to learn it all from scratch. I don’t currently see that it introduces new alignment problems, or changes how we should think about the existing problems that we work on, and that’s the main reason I don’t focus on it. But I certainly agree with that as a background model of what future AI systems will look like, and if someone identified a problem that happens with neurosymbolic AI that isn’t addressed by current work in AI alignment, I’d be pretty excited to see research solving that problem, and might do it myself.
----
Things I do agree with:
It would be significantly better if the average / median commenter on the Alignment Forum knew more about AI techniques. (I think this is also true of deep learning.)
There will probably be something in the future that radically changes our beliefs about AGI.
I feel that the whole AI alignment problem can be seen as problems with ontological shifts: https://www.lesswrong.com/posts/k54rgSg7GcjtXnMHX/model-splintering-moving-from-one-imperfect-model-to-another-1
I think I agree at least that many problems can be seen this way, but I suspect that other framings are more useful for solutions. (I don’t think I can explain why here, though I am working on a longer explanation of what framings I like and why.)
What I was claiming in the sentence you quoted was that I don’t see ontological shifts as a huge additional category of problem that isn’t covered by other problems, which is compatible with saying that ontological shifts can also represent many other problems.
Cheers, that would be very useful.