As just one example: what if superintelligence takes the from of a community of connected AGI running on (and intrinsically regulated by) a crypto-legal system, with decision policies implemented hierarchically over sub-agents (there’s an even a surprisingly strong argument the brain is a similar society of simpler minds resolving decisions through the basal ganglia). Then alignment is also a mechanism design problem, a socio-economic-political problem.
Although I guess that’s arguably still ‘technical’, just technical within an expanded domain.
13. People haven’t tried very hard to find non-MIRI-ish approaches that might work.
So you haven’t heard of IRL, CIRL, value learning, that whole DL safety track, etc? Or are you outright dismissing them? I’d argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.
(Which isn’t to say that MIRI-AF wasn’t a good investment on net for the world, even if it was low probability-of-success)
So you haven’t heard of IRL, CIRL, value learning, that whole DL safety track, etc? Or are you outright dismissing them? I’d argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.
This comment and the entire conversation that spawned from it is weirdly ungrounded in the text — I never even mentioned DL. The thing I was expressing was ‘relative to the capacity of the human race, and relative to the importance and (likely) difficulty of the alignment problem, very few research-hours have gone into the alignment problem at all, ever; so even if you’re pessimistic about the entire space of MIRI-ish research directions, you shouldn’t have a confident view that there are no out-of-left-field research directions that could arise in the future to take big bites out of the alignment problem’.
The rhetorical approach of the comment is also weird to me. ‘So you’ve never heard of CIRL?’ surely isn’t a hypothesis you’d give more weight to than ‘You think CIRL wasn’t a large advance’, ‘You think CIRL is MIRI-ish’, ‘You disagree with me about the size and importance of the alignment problem such that you think it should be a major civilizational effort’, ‘You think CIRL is cool but think we aren’t yet hitting diminishing returns on CIRL-sized insights and are therefore liable to come up with a lot more of them in the future’. etc. So I assume the question is rhetorical; but then it’s not clear to me what you believe about CIRL or what point you want to make with it.
I’d argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.
I think this is straightforwardly true in two different ways:
Prior to the deep learning revolution, Eliezer didn’t predict that ANNs would be a big deal — he expected other, neither-GOFAI-nor-connectionist approaches to AI to be the ones that hit milestones like ‘solve Go’.
MIRI thinks the current DL paradigm isn’t alignable, so we made a bet on trying to come up with more alignable AI approaches (which we thought probably wouldn’t succeed, but considered high-enough-EV to be worth the attempt).
I don’t think this has anything to do with the OP, but I’m happy to talk about it in its own right. The most relevant thing would be if we lost a bet like ‘we predict deep learning will be too opaque to align’, but we still are just as pessimistic about humanity’s ability to align deep nets are ever, so if you think we’ve hugely underestimated the tractability of aligning deep nets, I’d need to hear more about why. What’s the path to achieving astronomically good outcomes, on the assumption that the first AGI systems are produced by 2021-style ML methods?
Thanks, strong upvote, this is especially clarifying.
Firstly, I (partially?) agree that the current DL paradigm isn’t strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
The weakly alignable baseline should be “marginally better than humans”. Achieving that baseline as an MVP should be an emergency level high priority civilization project, even if risk of doom from DL AGI is only 1% (and to be clear, i’m quite uncertain, but it’s probably considerably higher). Ideally we should always have an MVP alignment solution in place.
My thoughts on your last question are probably best expressed in a short post rather than a comment thread, but in summary:
DL methods are based on simple universal learning architectures (eg transformers, but AGI will probably be built on something even more powerful). The important properties of resulting agents are thus much more a function of the data / training environment rather than the architecture. You can rather easily limit an AGI’s power by constraining it’s environment. For example we have nothing to fear from AGI’s trained solely in Atari. We have much more to fear from agents trained by eating the internet. Boxing is stupid, but sim sandboxing is key.
As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind’s case), there’s hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans—ie the approximate alignment solution that evolution found. We can then improve and iterate on that in simulations. I’m somewhat optimistic that it’s no more complex than other major brain systems we’ve already mostly reverse engineered.
The danger of course is that testing and iterating could use enormous resources, past the point where you already have a dangerous architecture that could be extracted. Nonetheless, I think this approach is much better than nothing, and amenable to (potentially amplified) iterative refinement.
Firstly, I (partially?) agree that the current DL paradigm isn’t strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
I don’t know what “strongly alignable”, “robust, high certainty paradigm”, or “approximately/weakly alignable” mean here. As I said in another comment:
There are two problems here:
Problem #1: Align limited task AGI to do some minimal act that ensures no one else can destroy the world with AGI.
Problem #2: Solve the full problem of using AGI to help us achieve an awesome future.
Problem #1 is the one I was talking about in the OP, and I think of it as the problem we need to solve on a deadline. Problem #2 is also indispensable (and a lot more philosophically fraught), but it’s something humanity can solve at its leisure once we’ve solved #1 and therefore aren’t at immediate risk of destroying ourselves.
If you have enough time to work on the problem, I think basically any practical goal can be achieved in CS, including robustly aligning deep nets. The question in my mind is not ‘what’s possible in principle, given arbitrarily large amounts of time?‘, but rather ‘what can we do in practice to actually end the acute risk period / ensure we don’t blow ourselves up in the immediate future?’.
(Where I’m imagining that you may have some number of years pre-AGI to steer toward relatively alignable approaches to AGI; and that once you get AGI, you have at most a few years to achieve some pivotal act that prevents AGI tech somewhere in the world from paperclipping the world.)
The weakly alignable baseline should be “marginally better than humans”.
I don’t understand this part. If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world’s existential risk. (Similarly, I think fast-running high-fidelity human emulations are one of the more plausible techs humanity could use to save the world, since you could then do a lot of scarily impressive intellectual work quickly (including work on the alignment problem) without putting massive work into cognitive transparency, oversight, etc.)
I’m taking for granted that AGI won’t be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred. So I’m thinking in terms of ‘what’s the least difficult-to-align act humanity could attempt with an AGI?’.
Maybe you mean something different by “marginally better than humans”?
As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind’s case), there’s hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans—ie the approximate alignment solution that evolution found.
I think this is a purely Problem #2 sort of research direction (‘we have subjective centuries to really nail down the full alignment problem’), not a Problem #1 research direction (‘we have a few months to a few years to do this one very concrete AI-developing-a-new-physical-technology task really well’).
For what it’s worth I’m cautiously optimistic that “reverse-engineering the circuits underlying empathy/love/altruism/etc.” is a realistic thing to do in years not decades, and can mostly be done in our current state of knowledge (i.e. before we have AGI-capable learning algorithms to play with—basically I think of AGI capabilities as largely involving learning algorithm development and empathy/whatnot as largely involving supervisory signals such as reward functions). I can share more details if you’re interested.
Maybe you mean something different by “marginally better than humans”?
No I meant “merely as aligned as a human”. Which is why I used “approximately/weakly” aligned—as the system which mostly aligns humans to humans is imperfect and not what I would have assumed you meant as a full Problem #2 type solution.
I’m taking for granted that AGI won’t be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred.
I think this is a purely Problem #2 sort of research direction (‘we have subjective centuries to really nail down the full alignment problem’),
Alright so now I’m guessing the crux is that you believe the DL based reverse engineered human empathy/altruism type solution I was alluding to—let’s just call that DLA—may take subjective centuries, which thus suggests that you believe:
That DLA is significantly more difficult than DL AGI in general
That uploading is likewise significantly more difficult
or perhaps
DLA isn’t necessarily super hard, but irrelevant because non-DL AGI (for which DLA isn’t effective) comes first
So I can see how that is a reasonable interpretation of what you were expressing. However, given the opening framing where you said you basically agreed with Eliezer’s pessimistic viewpoint that seems to dismiss most alignment research, I hope you can understand how I interpreted you saying “People haven’t tried very hard to find non-MIRI-ish approaches that might work” as dismissing ML-safety research like IRL,CIRL,etc.
I… think that makes more sense? Though Eliezer was saying the field’s progress overall was insufficient, not saying ‘decision theory good, ML bad’. He singled out eg Paul Christiano and Chris Olah as two of the field’s best researchers.
Interesting—it’s not clear to me how that dialogue addresses the common misconception.
My brief zero-effort counter-argument to that dialogue is: it’s hard to make rockets or airplanes safe without first mastering aerospace engineering.
So I think it’s super obvious that EY/MIRI/LW took the formalist side over connectivist, which I discuss more explicitly in the intro to my most recent 2021 post, which links to my 2015 post which discussed the closely connected ULM vs EM brain theories, which then links to my 2010 post discussing a half-baked connectivist alignment idea with some interesting early debate vs LW formalists (and also my successful prediction of first computer Go champion 5 years in advance).
So I’ve been here a while, and I even had a number of conversations with MIRI’s 2 person ML-but-not-DL alignment group (Jessica & Jack) when that was briefly a thing, and it would be extremely ambitious revisionist history to claim that EY/MIRI didn’t implicitly if not explicitly bet against connectivism.
So that’s why I asked Rob about point 13 above—as it seems unjustifiably dismissive of the now dominant connectivist-friendly alignment research (and said dismissal substantiates my point).
But I’m not here to get in some protracted argument about this. So why am I here? Because loitering on the event-horizon of phyg attractors are obvious historical schelling points to meet other interesting people. Speaking of which, we should chat—I really liked your Birds,Brains,Planes post in particular, I actually wrote up something quite similar a while ago.
Thanks! What’s a phyg attractor? Google turns up nothing.
To say a bit more about my skepticism—there are various reasons why one might want to focus on agent foundations and similar stuff even if you also think that deep learning is about to boom and be super effective and profitable. For example, you might think the deep-learning-based stuff is super hard to align relative to other paradigms. Or you might think that we won’t be able to align it until we are less confused about fundamental issues, and the way to deconfuse ourselves is to think in formal abstractions rather than messing around with big neural nets. Or you might think that both ways are viable but the formal abstraction route is relatively neglected. So the fact that MIRI bet on agent foundations stuff doesn’t seem like strong evidence that they were surprised by the deep learning boom, or at least, more surprised than their typical contemporaries.
Like I said in the parent comment—investing in AF can be a good bet, even if it’s low probability of success. And I mostly agree with your rationalizations there, but they are post-hoc. I challenge you to find early evidence (ideally 2010 or earlier—for reasons explained in a moment) documenting that MIRI leaders “also think that deep learning is about to boom and be super effective and profitable”.
The connectivist-futurists (Moravec/Kurzweil) were already predicting a timeline for AGI in the 2020′s through brain reverse engineering. EY/MIRI implicitly/explicitly critiqued that and literally invested time/money/resources in hiring/training up people (a whole community arguably!) in knowledge/beliefs very different from—and mostly useless for understanding—the connectivist/DL path to AGI.
So if you truly believed in 2010, after hearing some recent neuroscience-phd’s first presentation on how they were going to reverse engineer the brain (DeepMind), and you actually gave that even a 50% chance of success—do you truly believe it would be wise to invest the way MIRI did? And to be hostile to connectivist/DL approaches as they still are? Do you not think they at least burned some bridges? Have you seen EY’s recent thread, where he attempts a blatant revision-history critique of Moravec? (Moravec actually claimed AGI around 2028, not 2010, which seems surprisingly on-track prescient to me now in 2021).
Again, quoting Rob from above:
13. People haven’t tried very hard to find non-MIRI-ish approaches that might work.
Which I read as dismissing the DL-friendly alignment research tracks: IRL/CRL/value learning, etc. And EY explicitly dismisses most alignment research in some other recent thread.
I don’t know what to believe yet; I appreciate the evidence you are giving here (in particular your experience as someone who has been around in the community longer than me). My skepticism was about the inference from MIRI did abstract AF research --> MIRI thought deep learning would be much less effective than it in fact was.
I do remember reading some old posts from EY about connectionism that suggest that he at least failed to predict the deep learning boost in advance. That’s different from confidently predicting it wouldn’t happen though.
I too think that Moravec et al deserve praise for successfully predicting the deep learning boom and having accurate AI timelines 40 years in advance.
Old LessWrong meme—phyg is rot13 cult. For a while people were making “are we a cult” posts so much that it was actually messing with LessWrong’s SEO. Hence phyg.
Thanks! What’s a phyg attractor? Google turns up nothing.
Ask google what LW is—ie just start typing lesswrong or “lesswrong is a” and see the auto-complete. Using the word ‘phyg’ is an LW community norm attempt to re-train google.
I don’t think alignment is “just a technical problem” in any domain, because:
I don’t think there’s a good enough definition of “alignment” for it to be addressed in any technical way.
Saying that “being aligned” means “behaving according to human values” just throws it back to the question of how exactly you define what “human values” are. Are they what humans say they want? What humans actually do? What humans would say they wanted if they knew the results (and with what degree of certainty required)? What would make humans actually happiest (and don’t forget to define “happy”)? The extrapolated volition of humans under iterated enhancement (which, in addition to being uncomputable, is probably both dependent on initial conditions and path-dependent, with no particular justification for preferring one path over another at any given step)?
Insofar as there are at least some vague ideas of what “alignment” or “human values” might mean, treating alighnment as a technical problem would require those values to have a lot more coherence than they actually seem to have.
If you ask a human to justify its behavior at time X, the human will state a value V_x. If you ask to the same human to justify some other behavior at time Y, you’ll get a value V_y. V_x and V_y will often be mutually contradictory. You don’t have a technical problem until you have a philosophically satisfying way of resolving that contradiction, which is not just a technical issue. Yet at the same time there’s feedback pressure on that philosophical decision, because some resolutions might be a lot more technically implementable than others.
Even if individual humans had coherent values, there’s every reason to think that the individuals in groups don’t share those values, and absolutely no reason at all to think that groups will converge under any reasonable kind of extrapolation. So now you have a second philosophical-problem-with-technical-feedback, namely reconciling multiple people’s contradictory values. That’s a problem with thousands of years of history, by the way, and nobody’s managed to reduce it to the technical yet, even though the idea has occurred to people.
Then you get to the technical issues, which are likely to be very hard and may not be solvable within physical limits. But it’s not remotely a technical problem yet.
It is possible to define the alignment problem without using such fuzzy concepts as “happiness” or “value”.
For example, there are two agents: R and H. The agent R can do some actions.
The agent H prefers some of the R’s actions over other actions. For example, H prefers the action make_pie to the action kill_all_humans.
Some of the preferences are unknown even to H itself (e.g. if it prefers pierogi to borscht).
Among other things, the set of the R’s actions includes:
ask_h_which_of_the_actions_is_preferable
infer_preferences_from_the_behavior_of_h
explain_consequences_of_the_action_to_h
switch_itself_off
In any given situation, the perfect agent R always chooses the most preferable action (according to H). The goal is to create an agent that is as close to the perfect R as possible.
Of course, this formalism is incomplete. But i think it demonstrates that the alignment problem can be framed as a technical problem without delving into metaphysics.
If you replace “value” with “preference” in what I wrote, I believe that it all still applies.
If you both “ask H about the preferable action” and “infer H’s preferences from the behavior of H”, then what do you do when the two yield different answers? That’s not a technical question; you could technically choose either one or even try to “average” them somehow. And it will happen.
The same applies if you have to deal with two humans, H1 and H2; they are sometimes going to disagree. How do you choose then?
There are also technical problems with both of those, and they’re the kind of technical problems I was talking about that feed back on the philosophical choices. You might start with one philosophical position, then want to change when you saw the technical results.
For the first:
It assumes that H’s “real” preferences comport with what H says. That isn’t a given, because “preference” is just as hard to define as “value”. Choosing to ask H really amounts to defining preference to mean “stated preference”.
It also assumes that H will always be able to state a preference, will be able to do so in a way that you can correctly understand, and will not be unduly ambivalent about it.
You’d probably also prefer that H (or somebody else...) not regret that preference if it gets enacted. You’d probably like to have some ability to predict that H is going to get unintended consequences, and at least give H more information before going ahead. That’s an extra feature not implied by a technical specification based on just doing whatever H says.
Related to (3), it assumes that H can usefully state preferences about courses of action more complicated than H could plan, when the consequences themselves may be more complicated than H can understand. And you yourself may have very complicated forms of uncertainty about those consequences, which makes it all the harder to explain the whole thing to H.
All of that is pretty unlikely.
The second is worse:
It assumes that that H’s actions always reflect H’s preferences, which amounts to adopting a different definition of “preference”, probably even further from the common meaning.
H’s preferences aren’t required to be any simpler or more regular than a list of every possible individual situation, with a preferred course of action for each one independent of all others. For that matter, the list is allowed to change, or be dependent on when some particular circumstances occur, or include “never do the same thing twice in the same circumstances”. Even if H’s behavior is assumed to reflect H’s preferences, theres still nothing that says H has to have an inferrable set of preferences.
To make inferences about H’s preferences, you first have to make a leap of faith and assume that they’re simple enough, compact enough, and consistent enough to be “closely enough” approximated by any set of rules you can infer. That is a non-technical leap of faith. And there’s a very good chance that it would be the wrong leap to make.
It assumes that the rules you can infer from H’s behavior are reasonably prescriptive about the choices you might have to make. Your action space may be far beyond anything H could do, and the choices you have to make may be far beyond anything H could understand.
So you end up taking a bunch of at best approximate inferences about H’s existing preferences, and trying to use them to figure out “What would H do if H were not a human, but in fact some kind of superhuman AGI totally unlike a human, but were somehow still H?”. That’s probably not a reasonable question to ask.
Oh, one more thing I should probably add: it gets even more interesting when you ask whether the AGI might act to change the human’s value (or preferences; there’s really no difference and both are equally “fuzzy” concepts). Any action that affects the human at all is likely to have some effect on them, and some actions could be targeted to have very large effects.
I agree, you’ve listed some very valid concerns about my half-backed formalism.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
The formalism doesn’t have to be perfect. If our theoretical R makes its decisions according to the best possible approximate inferences about H’s existing preferences, then the R is much better than rogue AGI. Even if sometimes it will make deadly mistakes. Any improvement over rogue AGI is a good improvement.
Compare: the Tesla AI sometimes causes deadly crashes. Yet the Tesla AI is much better than the status quo, as its net effect are thousands of saved lives.
And after we have a decent formalism, we can build a better formalism from it, and then repeat and repeat.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
Nobody’s even gotten close to metaphysics. Ethics or even epistemology, OK. Metaphysics, no. The reason I’m getting pedantic about the technical meaning of the word is that “metaphysics”, when used non-technically, is often a tag word used for “all that complicated, badly-understood stuff that might interfere with bulling ahead”.
My narrow point is that alignment isn’t a technical problem until you already have an adequate final formalism. Creating the formalism itself isn’t an entirely technical process.
If you’re talking about inferring, learning, being instructed about, or actually carrying out human preferences, values, or paths to a “good outcome”, then as far as I know nobody has an approximately adequate formalism, and nobody has a formalism with any clear path to be extended to adequacy, or even any clear hope of it. I’ve seen proposals, but none of them have stood up to 15 minutes of thought. I don’t follow it all the time; maybe I’ve missed something.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use. There’s no clear statement of what that would mean.
My broader concern is that I’m unsure an adequate list of meta-criteria can be established, and that I’m even less sure that the base formalism can exist at all. Demanding a formal system that can’t be achieved can lead to all kinds of bad outcomes, many of them related to erroneously pretending that a formalism you have usefully approximates the formalism you need.
It would be very easy to decide that, for the sake of “avoiding metaphysics”, it was important to adopt, agree upon, and stay within a certain framework—one that did not meet meta-criteria like “allows you to express constraints that assure that everybody doesn’t end up worse than dead”, let alone “allows you to express what it means to achieve the maximum benefit from AGI”, or “must provide prescriptions implementable in actual software”.
Oh, people would keep tweaking any given framework to to cover more edge cases, and squeeze more and more looseness out of some defintions, and play around with more and more elegant statements of the whole thing… but that could just be a nice distraction from the fundamental lack of any “no fate worse than death” guarantee anywhere in it.
A useful formalism does have to be perfect in achieving no fates worse than death, or no widespread fates worse than death. It has to define fates worse than death in a meaningful way that doesn’t sacrifice the motivation for having the constraint in the first place. It has to achieve that over all possible fates worse than death, including ones nobody has thought of yet. It has to let you at least approximately exclude at least the widespread occurrence of anything that almost anybody would think was a fate worse than death. Ideally while also enabling you to actually get positive benefits from your AGI.
And formal frameworks are often brittle; a formalism that doesn’t guarantee perfection does not necessarily even avert catastrophe. If you make a small mistake in defining “fate worse than death”, that may lead to a very large prevalence of the case you missed.
It’s not even true that “the best possible inferences” are necessarily better than nothing, let alone adequate in any absolute sense. In fact, a truly rogue AGI that doesn’t care about you at all seems more likely to just kill you quickly, whereas who knows what a buggy AGI that was interested in your fate might choose to do...
The very adoption of the word “aligment” seems to be a symptom of a desire to at least appear to move toward formalizing, without the change actually tending to improve the chances of a good outcome. I think people were trying to tighten up from “good outcome” when they adopted “alignment”, but actually I don’t think it helped. The connotations of the word “alignment” tend to concentrate attention on approaches that rely on humans to know what they want, or at least to have coherent desires, which isn’t necessarily a good idea at all. On the other hand, the switch doesn’t actually seem to make it any easier to design formal structures or technical approaches that will actually lead to good software behavior. It’s still vague in all the ways that matter, and it doesn’t seem to be improving at all.
To create a perfect AI for self-driving, one first must resolve all that complicated, badly-understood stuff that might interfere with bulling ahead. For example, if the car should prefer the driver’s life over the pedestrian’s life.
But while we contemplate such questions, we lose tens of thousands of lives in car crashes per year.
The people of Tesla made the rational decision of bulling aheadinstead. As their AI is not perfect, sometimes it makes decisions with deadly consequences. But in total, it saves lives.
Their AI has an imperfect but good enough formalism. AFAIK, it’s something that could be described in English as “drive to the destination without breaking the driving regulations, while minimizing the number of crashes”, or something like this.
As their AI is net saving lives, it means their formalism is indeed good enough. They have successfully reduced a complex ethical/societal problem to a purely technical problem.
Rogue AGI is very likely to kill all humans. Any better-than-rogue-AGI is an improvement, even if it doesn’t fully understand the complicated and ever changing human preferences, and even if some people will suffer as a result.
Even my half-backed sketch of a formalism, if implemented, will produce an AI that is better than rogue AGI, in spite of the many problems you listed. Thus, working on it is better than waiting for the certain death.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use
A formalism that saves more lives is better than the one that saves less lives. That’s good enough for a start.
If you’re trying to solve a hard problem, start with something simple and then iteratively improve over it. This includes meta-criteria.
fate worse than death
I strongly believe that there is no such a thing. Explained it in detail here.
I agree with your sketch of the alignment problem.
But once you move past the sketch stage the solutions depend heavily on the structure of A, which is why I questioned Rob’s dismissal of the now-dominant non-MIRI safety approaches (which are naturally more connectivist/DL friendly).
Is it though? Seriously.
As just one example: what if superintelligence takes the from of a community of connected AGI running on (and intrinsically regulated by) a crypto-legal system, with decision policies implemented hierarchically over sub-agents (there’s an even a surprisingly strong argument the brain is a similar society of simpler minds resolving decisions through the basal ganglia). Then alignment is also a mechanism design problem, a socio-economic-political problem.
Although I guess that’s arguably still ‘technical’, just technical within an expanded domain.
So you haven’t heard of IRL, CIRL, value learning, that whole DL safety track, etc? Or are you outright dismissing them? I’d argue instead that MIRI bet heavily against connectivism/DL, and lost on that bet just as heavily.
(Which isn’t to say that MIRI-AF wasn’t a good investment on net for the world, even if it was low probability-of-success)
This comment and the entire conversation that spawned from it is weirdly ungrounded in the text — I never even mentioned DL. The thing I was expressing was ‘relative to the capacity of the human race, and relative to the importance and (likely) difficulty of the alignment problem, very few research-hours have gone into the alignment problem at all, ever; so even if you’re pessimistic about the entire space of MIRI-ish research directions, you shouldn’t have a confident view that there are no out-of-left-field research directions that could arise in the future to take big bites out of the alignment problem’.
The rhetorical approach of the comment is also weird to me. ‘So you’ve never heard of CIRL?’ surely isn’t a hypothesis you’d give more weight to than ‘You think CIRL wasn’t a large advance’, ‘You think CIRL is MIRI-ish’, ‘You disagree with me about the size and importance of the alignment problem such that you think it should be a major civilizational effort’, ‘You think CIRL is cool but think we aren’t yet hitting diminishing returns on CIRL-sized insights and are therefore liable to come up with a lot more of them in the future’. etc. So I assume the question is rhetorical; but then it’s not clear to me what you believe about CIRL or what point you want to make with it.
(Ditto value learning, IRL, etc.)
I think this is straightforwardly true in two different ways:
Prior to the deep learning revolution, Eliezer didn’t predict that ANNs would be a big deal — he expected other, neither-GOFAI-nor-connectionist approaches to AI to be the ones that hit milestones like ‘solve Go’.
MIRI thinks the current DL paradigm isn’t alignable, so we made a bet on trying to come up with more alignable AI approaches (which we thought probably wouldn’t succeed, but considered high-enough-EV to be worth the attempt).
I don’t think this has anything to do with the OP, but I’m happy to talk about it in its own right. The most relevant thing would be if we lost a bet like ‘we predict deep learning will be too opaque to align’, but we still are just as pessimistic about humanity’s ability to align deep nets are ever, so if you think we’ve hugely underestimated the tractability of aligning deep nets, I’d need to hear more about why. What’s the path to achieving astronomically good outcomes, on the assumption that the first AGI systems are produced by 2021-style ML methods?
Thanks, strong upvote, this is especially clarifying.
Firstly, I (partially?) agree that the current DL paradigm isn’t strongly alignable (in a robust, high certainty paradigm), we may or may not agree to what extent it is approximately/weakly alignable.
The weakly alignable baseline should be “marginally better than humans”. Achieving that baseline as an MVP should be an emergency level high priority civilization project, even if risk of doom from DL AGI is only 1% (and to be clear, i’m quite uncertain, but it’s probably considerably higher). Ideally we should always have an MVP alignment solution in place.
My thoughts on your last question are probably best expressed in a short post rather than a comment thread, but in summary:
DL methods are based on simple universal learning architectures (eg transformers, but AGI will probably be built on something even more powerful). The important properties of resulting agents are thus much more a function of the data / training environment rather than the architecture. You can rather easily limit an AGI’s power by constraining it’s environment. For example we have nothing to fear from AGI’s trained solely in Atari. We have much more to fear from agents trained by eating the internet. Boxing is stupid, but sim sandboxing is key.
As DL methods are already a success story in partial brain reverse engineering (explicitly in deepmind’s case), there’s hope for reverse engineering the circuits underlying empathy/love/altruism/etc in humans—ie the approximate alignment solution that evolution found. We can then improve and iterate on that in simulations. I’m somewhat optimistic that it’s no more complex than other major brain systems we’ve already mostly reverse engineered.
The danger of course is that testing and iterating could use enormous resources, past the point where you already have a dangerous architecture that could be extracted. Nonetheless, I think this approach is much better than nothing, and amenable to (potentially amplified) iterative refinement.
I don’t know what “strongly alignable”, “robust, high certainty paradigm”, or “approximately/weakly alignable” mean here. As I said in another comment:
If you have enough time to work on the problem, I think basically any practical goal can be achieved in CS, including robustly aligning deep nets. The question in my mind is not ‘what’s possible in principle, given arbitrarily large amounts of time?‘, but rather ‘what can we do in practice to actually end the acute risk period / ensure we don’t blow ourselves up in the immediate future?’.
(Where I’m imagining that you may have some number of years pre-AGI to steer toward relatively alignable approaches to AGI; and that once you get AGI, you have at most a few years to achieve some pivotal act that prevents AGI tech somewhere in the world from paperclipping the world.)
I don’t understand this part. If we had AGI that were merely as aligned as a human, I think that would immediately eliminate nearly all of the world’s existential risk. (Similarly, I think fast-running high-fidelity human emulations are one of the more plausible techs humanity could use to save the world, since you could then do a lot of scarily impressive intellectual work quickly (including work on the alignment problem) without putting massive work into cognitive transparency, oversight, etc.)
I’m taking for granted that AGI won’t be anywhere near as aligned as a human until long after either the world has been destroyed, or a pivotal act has occurred. So I’m thinking in terms of ‘what’s the least difficult-to-align act humanity could attempt with an AGI?’.
Maybe you mean something different by “marginally better than humans”?
I think this is a purely Problem #2 sort of research direction (‘we have subjective centuries to really nail down the full alignment problem’), not a Problem #1 research direction (‘we have a few months to a few years to do this one very concrete AI-developing-a-new-physical-technology task really well’).
For what it’s worth I’m cautiously optimistic that “reverse-engineering the circuits underlying empathy/love/altruism/etc.” is a realistic thing to do in years not decades, and can mostly be done in our current state of knowledge (i.e. before we have AGI-capable learning algorithms to play with—basically I think of AGI capabilities as largely involving learning algorithm development and empathy/whatnot as largely involving supervisory signals such as reward functions). I can share more details if you’re interested.
No I meant “merely as aligned as a human”. Which is why I used “approximately/weakly” aligned—as the system which mostly aligns humans to humans is imperfect and not what I would have assumed you meant as a full Problem #2 type solution.
Alright so now I’m guessing the crux is that you believe the DL based reverse engineered human empathy/altruism type solution I was alluding to—let’s just call that DLA—may take subjective centuries, which thus suggests that you believe:
That DLA is significantly more difficult than DL AGI in general
That uploading is likewise significantly more difficult
or perhaps
DLA isn’t necessarily super hard, but irrelevant because non-DL AGI (for which DLA isn’t effective) comes first
Is any of that right?
Sounds right, yeah!
So I can see how that is a reasonable interpretation of what you were expressing. However, given the opening framing where you said you basically agreed with Eliezer’s pessimistic viewpoint that seems to dismiss most alignment research, I hope you can understand how I interpreted you saying “People haven’t tried very hard to find non-MIRI-ish approaches that might work” as dismissing ML-safety research like IRL,CIRL,etc.
I… think that makes more sense? Though Eliezer was saying the field’s progress overall was insufficient, not saying ‘decision theory good, ML bad’. He singled out eg Paul Christiano and Chris Olah as two of the field’s best researchers.
In any case, thanks for explaining!
For years they have consistently denied this, saying it’s a common misconception. See e.g. here. I am interested to hear your argument.
Interesting—it’s not clear to me how that dialogue addresses the common misconception.
My brief zero-effort counter-argument to that dialogue is: it’s hard to make rockets or airplanes safe without first mastering aerospace engineering.
So I think it’s super obvious that EY/MIRI/LW took the formalist side over connectivist, which I discuss more explicitly in the intro to my most recent 2021 post, which links to my 2015 post which discussed the closely connected ULM vs EM brain theories, which then links to my 2010 post discussing a half-baked connectivist alignment idea with some interesting early debate vs LW formalists (and also my successful prediction of first computer Go champion 5 years in advance).
So I’ve been here a while, and I even had a number of conversations with MIRI’s 2 person ML-but-not-DL alignment group (Jessica & Jack) when that was briefly a thing, and it would be extremely ambitious revisionist history to claim that EY/MIRI didn’t implicitly if not explicitly bet against connectivism.
So that’s why I asked Rob about point 13 above—as it seems unjustifiably dismissive of the now dominant connectivist-friendly alignment research (and said dismissal substantiates my point).
But I’m not here to get in some protracted argument about this. So why am I here? Because loitering on the event-horizon of phyg attractors are obvious historical schelling points to meet other interesting people. Speaking of which, we should chat—I really liked your Birds,Brains,Planes post in particular, I actually wrote up something quite similar a while ago.
Thanks! What’s a phyg attractor? Google turns up nothing.
To say a bit more about my skepticism—there are various reasons why one might want to focus on agent foundations and similar stuff even if you also think that deep learning is about to boom and be super effective and profitable. For example, you might think the deep-learning-based stuff is super hard to align relative to other paradigms. Or you might think that we won’t be able to align it until we are less confused about fundamental issues, and the way to deconfuse ourselves is to think in formal abstractions rather than messing around with big neural nets. Or you might think that both ways are viable but the formal abstraction route is relatively neglected. So the fact that MIRI bet on agent foundations stuff doesn’t seem like strong evidence that they were surprised by the deep learning boom, or at least, more surprised than their typical contemporaries.
Skepticism of what?
Like I said in the parent comment—investing in AF can be a good bet, even if it’s low probability of success. And I mostly agree with your rationalizations there, but they are post-hoc. I challenge you to find early evidence (ideally 2010 or earlier—for reasons explained in a moment) documenting that MIRI leaders “also think that deep learning is about to boom and be super effective and profitable”.
The connectivist-futurists (Moravec/Kurzweil) were already predicting a timeline for AGI in the 2020′s through brain reverse engineering. EY/MIRI implicitly/explicitly critiqued that and literally invested time/money/resources in hiring/training up people (a whole community arguably!) in knowledge/beliefs very different from—and mostly useless for understanding—the connectivist/DL path to AGI.
So if you truly believed in 2010, after hearing some recent neuroscience-phd’s first presentation on how they were going to reverse engineer the brain (DeepMind), and you actually gave that even a 50% chance of success—do you truly believe it would be wise to invest the way MIRI did? And to be hostile to connectivist/DL approaches as they still are? Do you not think they at least burned some bridges? Have you seen EY’s recent thread, where he attempts a blatant revision-history critique of Moravec? (Moravec actually claimed AGI around 2028, not 2010, which seems surprisingly on-track prescient to me now in 2021).
Again, quoting Rob from above:
Which I read as dismissing the DL-friendly alignment research tracks: IRL/CRL/value learning, etc. And EY explicitly dismisses most alignment research in some other recent thread.
I don’t know what to believe yet; I appreciate the evidence you are giving here (in particular your experience as someone who has been around in the community longer than me). My skepticism was about the inference from MIRI did abstract AF research --> MIRI thought deep learning would be much less effective than it in fact was.
I do remember reading some old posts from EY about connectionism that suggest that he at least failed to predict the deep learning boost in advance. That’s different from confidently predicting it wouldn’t happen though.
I too think that Moravec et al deserve praise for successfully predicting the deep learning boom and having accurate AI timelines 40 years in advance.
Old LessWrong meme—phyg is rot13 cult. For a while people were making “are we a cult” posts so much that it was actually messing with LessWrong’s SEO. Hence phyg.
Ask google what LW is—ie just start typing lesswrong or “lesswrong is a” and see the auto-complete. Using the word ‘phyg’ is an LW community norm attempt to re-train google.
I don’t think alignment is “just a technical problem” in any domain, because:
I don’t think there’s a good enough definition of “alignment” for it to be addressed in any technical way.
Saying that “being aligned” means “behaving according to human values” just throws it back to the question of how exactly you define what “human values” are. Are they what humans say they want? What humans actually do? What humans would say they wanted if they knew the results (and with what degree of certainty required)? What would make humans actually happiest (and don’t forget to define “happy”)? The extrapolated volition of humans under iterated enhancement (which, in addition to being uncomputable, is probably both dependent on initial conditions and path-dependent, with no particular justification for preferring one path over another at any given step)?
Insofar as there are at least some vague ideas of what “alignment” or “human values” might mean, treating alighnment as a technical problem would require those values to have a lot more coherence than they actually seem to have.
If you ask a human to justify its behavior at time X, the human will state a value V_x. If you ask to the same human to justify some other behavior at time Y, you’ll get a value V_y. V_x and V_y will often be mutually contradictory. You don’t have a technical problem until you have a philosophically satisfying way of resolving that contradiction, which is not just a technical issue. Yet at the same time there’s feedback pressure on that philosophical decision, because some resolutions might be a lot more technically implementable than others.
Even if individual humans had coherent values, there’s every reason to think that the individuals in groups don’t share those values, and absolutely no reason at all to think that groups will converge under any reasonable kind of extrapolation. So now you have a second philosophical-problem-with-technical-feedback, namely reconciling multiple people’s contradictory values. That’s a problem with thousands of years of history, by the way, and nobody’s managed to reduce it to the technical yet, even though the idea has occurred to people.
Then you get to the technical issues, which are likely to be very hard and may not be solvable within physical limits. But it’s not remotely a technical problem yet.
It is possible to define the alignment problem without using such fuzzy concepts as “happiness” or “value”.
For example, there are two agents: R and H. The agent R can do some actions.
The agent H prefers some of the R’s actions over other actions. For example, H prefers the action
make_pie
to the actionkill_all_humans
.Some of the preferences are unknown even to H itself (e.g. if it prefers
pierogi
toborscht
).Among other things, the set of the R’s actions includes:
ask_h_which_of_the_actions_is_preferable
infer_preferences_from_the_behavior_of_h
explain_consequences_of_the_action_to_h
switch_itself_off
In any given situation, the perfect agent R always chooses the most preferable action (according to H). The goal is to create an agent that is as close to the perfect R as possible.
Of course, this formalism is incomplete. But i think it demonstrates that the alignment problem can be framed as a technical problem without delving into metaphysics.
If you replace “value” with “preference” in what I wrote, I believe that it all still applies.
If you both “ask H about the preferable action” and “infer H’s preferences from the behavior of H”, then what do you do when the two yield different answers? That’s not a technical question; you could technically choose either one or even try to “average” them somehow. And it will happen.
The same applies if you have to deal with two humans, H1 and H2; they are sometimes going to disagree. How do you choose then?
There are also technical problems with both of those, and they’re the kind of technical problems I was talking about that feed back on the philosophical choices. You might start with one philosophical position, then want to change when you saw the technical results.
For the first:
It assumes that H’s “real” preferences comport with what H says. That isn’t a given, because “preference” is just as hard to define as “value”. Choosing to ask H really amounts to defining preference to mean “stated preference”.
It also assumes that H will always be able to state a preference, will be able to do so in a way that you can correctly understand, and will not be unduly ambivalent about it.
You’d probably also prefer that H (or somebody else...) not regret that preference if it gets enacted. You’d probably like to have some ability to predict that H is going to get unintended consequences, and at least give H more information before going ahead. That’s an extra feature not implied by a technical specification based on just doing whatever H says.
Related to (3), it assumes that H can usefully state preferences about courses of action more complicated than H could plan, when the consequences themselves may be more complicated than H can understand. And you yourself may have very complicated forms of uncertainty about those consequences, which makes it all the harder to explain the whole thing to H.
All of that is pretty unlikely.
The second is worse:
It assumes that that H’s actions always reflect H’s preferences, which amounts to adopting a different definition of “preference”, probably even further from the common meaning.
H’s preferences aren’t required to be any simpler or more regular than a list of every possible individual situation, with a preferred course of action for each one independent of all others. For that matter, the list is allowed to change, or be dependent on when some particular circumstances occur, or include “never do the same thing twice in the same circumstances”. Even if H’s behavior is assumed to reflect H’s preferences, theres still nothing that says H has to have an inferrable set of preferences.
To make inferences about H’s preferences, you first have to make a leap of faith and assume that they’re simple enough, compact enough, and consistent enough to be “closely enough” approximated by any set of rules you can infer. That is a non-technical leap of faith. And there’s a very good chance that it would be the wrong leap to make.
It assumes that the rules you can infer from H’s behavior are reasonably prescriptive about the choices you might have to make. Your action space may be far beyond anything H could do, and the choices you have to make may be far beyond anything H could understand.
So you end up taking a bunch of at best approximate inferences about H’s existing preferences, and trying to use them to figure out “What would H do if H were not a human, but in fact some kind of superhuman AGI totally unlike a human, but were somehow still H?”. That’s probably not a reasonable question to ask.
Oh, one more thing I should probably add: it gets even more interesting when you ask whether the AGI might act to change the human’s value (or preferences; there’s really no difference and both are equally “fuzzy” concepts). Any action that affects the human at all is likely to have some effect on them, and some actions could be targeted to have very large effects.
I agree, you’ve listed some very valid concerns about my half-backed formalism.
As I see it, the first step in solving the alignment problem is to create a good formalism without delving into metaphysics.
The formalism doesn’t have to be perfect. If our theoretical R makes its decisions according to the best possible approximate inferences about H’s existing preferences, then the R is much better than rogue AGI. Even if sometimes it will make deadly mistakes. Any improvement over rogue AGI is a good improvement.
Compare: the Tesla AI sometimes causes deadly crashes. Yet the Tesla AI is much better than the status quo, as its net effect are thousands of saved lives.
And after we have a decent formalism, we can build a better formalism from it, and then repeat and repeat.
Nobody’s even gotten close to metaphysics. Ethics or even epistemology, OK. Metaphysics, no. The reason I’m getting pedantic about the technical meaning of the word is that “metaphysics”, when used non-technically, is often a tag word used for “all that complicated, badly-understood stuff that might interfere with bulling ahead”.
My narrow point is that alignment isn’t a technical problem until you already have an adequate final formalism. Creating the formalism itself isn’t an entirely technical process.
If you’re talking about inferring, learning, being instructed about, or actually carrying out human preferences, values, or paths to a “good outcome”, then as far as I know nobody has an approximately adequate formalism, and nobody has a formalism with any clear path to be extended to adequacy, or even any clear hope of it. I’ve seen proposals, but none of them have stood up to 15 minutes of thought. I don’t follow it all the time; maybe I’ve missed something.
In fact, even asking for an “adequate” formalism is putting the cart before the horse, because nobody even has a set of reasonable meta-criteria to use to evaluate whether any given formalism is fit for use. There’s no clear statement of what that would mean.
My broader concern is that I’m unsure an adequate list of meta-criteria can be established, and that I’m even less sure that the base formalism can exist at all. Demanding a formal system that can’t be achieved can lead to all kinds of bad outcomes, many of them related to erroneously pretending that a formalism you have usefully approximates the formalism you need.
It would be very easy to decide that, for the sake of “avoiding metaphysics”, it was important to adopt, agree upon, and stay within a certain framework—one that did not meet meta-criteria like “allows you to express constraints that assure that everybody doesn’t end up worse than dead”, let alone “allows you to express what it means to achieve the maximum benefit from AGI”, or “must provide prescriptions implementable in actual software”.
Oh, people would keep tweaking any given framework to to cover more edge cases, and squeeze more and more looseness out of some defintions, and play around with more and more elegant statements of the whole thing… but that could just be a nice distraction from the fundamental lack of any “no fate worse than death” guarantee anywhere in it.
A useful formalism does have to be perfect in achieving no fates worse than death, or no widespread fates worse than death. It has to define fates worse than death in a meaningful way that doesn’t sacrifice the motivation for having the constraint in the first place. It has to achieve that over all possible fates worse than death, including ones nobody has thought of yet. It has to let you at least approximately exclude at least the widespread occurrence of anything that almost anybody would think was a fate worse than death. Ideally while also enabling you to actually get positive benefits from your AGI.
And formal frameworks are often brittle; a formalism that doesn’t guarantee perfection does not necessarily even avert catastrophe. If you make a small mistake in defining “fate worse than death”, that may lead to a very large prevalence of the case you missed.
It’s not even true that “the best possible inferences” are necessarily better than nothing, let alone adequate in any absolute sense. In fact, a truly rogue AGI that doesn’t care about you at all seems more likely to just kill you quickly, whereas who knows what a buggy AGI that was interested in your fate might choose to do...
The very adoption of the word “aligment” seems to be a symptom of a desire to at least appear to move toward formalizing, without the change actually tending to improve the chances of a good outcome. I think people were trying to tighten up from “good outcome” when they adopted “alignment”, but actually I don’t think it helped. The connotations of the word “alignment” tend to concentrate attention on approaches that rely on humans to know what they want, or at least to have coherent desires, which isn’t necessarily a good idea at all. On the other hand, the switch doesn’t actually seem to make it any easier to design formal structures or technical approaches that will actually lead to good software behavior. It’s still vague in all the ways that matter, and it doesn’t seem to be improving at all.
We could use the Tesla AI as a model.
To create a perfect AI for self-driving, one first must resolve all that complicated, badly-understood stuff that might interfere with bulling ahead. For example, if the car should prefer the driver’s life over the pedestrian’s life.
But while we contemplate such questions, we lose tens of thousands of lives in car crashes per year.
The people of Tesla made the rational decision of bulling ahead instead. As their AI is not perfect, sometimes it makes decisions with deadly consequences. But in total, it saves lives.
Their AI has an imperfect but good enough formalism. AFAIK, it’s something that could be described in English as “drive to the destination without breaking the driving regulations, while minimizing the number of crashes”, or something like this.
As their AI is net saving lives, it means their formalism is indeed good enough. They have successfully reduced a complex ethical/societal problem to a purely technical problem.
Rogue AGI is very likely to kill all humans. Any better-than-rogue-AGI is an improvement, even if it doesn’t fully understand the complicated and ever changing human preferences, and even if some people will suffer as a result.
Even my half-backed sketch of a formalism, if implemented, will produce an AI that is better than rogue AGI, in spite of the many problems you listed. Thus, working on it is better than waiting for the certain death.
A formalism that saves more lives is better than the one that saves less lives. That’s good enough for a start.
If you’re trying to solve a hard problem, start with something simple and then iteratively improve over it. This includes meta-criteria.
I strongly believe that there is no such a thing. Explained it in detail here.
I agree with your sketch of the alignment problem.
But once you move past the sketch stage the solutions depend heavily on the structure of A, which is why I questioned Rob’s dismissal of the now-dominant non-MIRI safety approaches (which are naturally more connectivist/DL friendly).