EDIT: This comment fails on a lot of points, as discussed in this apology subcomment. I encourage people interested by the thread to mostly read the apology subcomment and the list of comments linked there, which provide maximum value with minimum drama IMO.
Disclaimer: this is a rant. In the best possible world, I could write from a calmer place, but I’m pretty sure that the taboo on criticizing MIRI and EY too hard on the AF can only be pushed through when I’m annoyed enough. That being said, I’m writing down thoughts that I had for quite some time, so don’t just discard this as a gut reaction to the post.
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Tl;dr:
I’m annoyed by EY (and maybe MIRI’s?) dismissal of every other alignment work, and how seriously it seems to be taken here, given their track record of choosing research agendas with very indirect impact on alignment, and of taking a lot of time to let go of these flawed agendas in the face of mounting evidence.
I’m annoyed that I have to deal with the nerd-sniping introduced by MIRI when bringing new people to the field, especially given the status imbalance.
I’m sad that EY and MIRI’s response to their research agenda not being as promising as they wanted is “we’re all doomed”.
Honestly, I really, really tried to find how MIRI’s Agents Foundations agenda was supposed to help with alignment. I really did. Some people tried to explain it to me. And I wanted to believe, because logic and maths are amazing tools with which to attack this most important problem, alignment. Yet I can’t escape the fact that the only contributions to technical alignment I can see by MIRI have been done by a bunch of people who mostly do their own thing instead of following MIRI’s core research program: Evan, Vanessa, Abram and Scott. (Note that this is my own judgement, and I haven’t talked to these people about this comment, so if you disagree with me, please don’t go at them).
All the rest, including some of the things these people worked on (but not most of it), is nerd-sniping as far as I’m concerned. It’s a tricky failure mode because it looks like good and serious research to the AF and LW audience. But there’s still a massive difference with actually trying to solve the real problems related to alignment, with all the tools at our disposal, and deciding that the focus should be on a handful of toy problems neatly expressed with decision theory and logic, and then only work on those.
That’s already bad enough. But then we have posts like this one, where EY just dunks on everyone working on alignment as fakers, or having terrible ideas. And at that point, I really wonder: why is that supposed to be an argument of authority anymore? Yes, I give massive credibility points to EY and MIRI for starting the field of alignment almost by themselves, and for articulating a lot of the issues. Yet all of the work that looked actually pushed by the core MIRI team (and based on some of EY’s work) from MIRI’s beginning are just toying with logic problems with hardly any connections here and there to alignment. (I know they’re not publishing most of it, but that judgment applies to their currently published work, and from the agenda and the blog posts, it sounds like most of the unpublished work was definitely along those lines). Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.
Note that this also has massive downsides for conceptual alignment in general, because when bringing people in, you have to deal with this specter of nerd-sniping by the founding lab of the field and still a figure of authority. I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like they’re actually moving the ball instead of following the lead of the “figure of authority”.
When I’m not frustrated by this situation, I’m just sad. Some of the brightest and first thinkers on alignment have decided to follow their own nerd-sniping and call everyone else fakers, and when they realized they were not actually making progress, they didn’t switch to something else as much as declare everyone was still full of it and so because they had no idea of how to solve the problem at the moment, it was doomed.
What could be done differently? Well, I would really, really like if EY and other MIRI people who are very dubious of most alignment research could give more feedback on that and enter the dialogue, maybe by commenting more on the AF. My problem is not so much with them disagreeing with most of the work, it’s about the disagreement stopping to “that’s not going to work” and not having dialogue and back and forth.
Also, I don’t know how much is related to mental health and pessimism and depression (which I completely understand can color one’s view of the world), but I would love to see the core MIRI team and EY actually try solving alignment with neural nets and prosaic AI. Starting with all their fears and caveats, sure, but then be like “fuck it, let’s just find a new way of grappling it”. That’s really what saddens me the most about all of this: I feel that some of the best current minds who care about alignment have sort of given up on actually trying to solve it.
This is an apology for the tone and the framing of the above comment (and my following answers), which have both been needlessly aggressive, status-focused and uncharitable. Underneath are still issues that matter a lot to me, but others have discussed them better (I’ll provide a list of linked comments at the end of this one).
Thanks to Richard Ngo for convincing me that I actually needed to write such an apology, which was probably the needed push for me to stop weaseling around it.
So what did I do wrong? The list is pretty damning:
I took something about the original post that I didn’t understand — EY’s “And then there is, so far as I can tell, a vast desert full of work that seems to me to be mostly fake or pointless or predictable.” — and because it didn’t make sense to me, and because that fitted with my stereotypes for MIRI and EY’s dismissiveness of a lot of work in alignment, I turned to an explanation of this as an attack on alignment researchers, saying they were consciously faking it when they knew they should do better. Whereas I feel know that what EY meant is far closer to alignment research at the moment is trying to try to align AI as best as we can, instead of just trying to do it. I’m still not sure if I agree with that characterization, but that sounds far more like something that can be discussed.
There’s also a weird aspect of status-criticism to my comment that I think I completely failed to explain. Looking at my motives now (let’s be wary of hindsight...), I feel like my issue with the status things was more that a bunch of people other than EY and MIRI just take what they say as super strong evidence without looking at all the arguments and details, and thus I expected this post and recent MIRI publications to create a background of “we’re doomed” for a lot of casual observers, with the force of the status of EY and MIRI. But I don’t want to say that EY and MIRI are given too much status in general in the community, even if I actually wrote something along those lines. I guess it’s just easier to focus your criticism on the beacon of status than on the invisible crowd misusing status. Sorry about that.
I somehow turned that into an attack of MIRI’s research (at least a chunk of it), which didn’t really have anything to do with it. That probably was just the manifestation of my frustration when people come to the field and feel like they shouldn’t do the experimental research that they fill better suited for or feel like they need to learn a lot of advanced maths. Even if those are not official MIRI positions, I definitely feel MIRI has had a big influence on them. And yet, maybe newcomers should question themselves that way. It always sounded like a loss of potential to me, because the outcome is often to not do alignment; but maybe even if you’re into experiments, the best way you could align AIs now doesn’t go through that path (and you could still find that exciting enough to find new research). Whatever the correct answer is, my weird ad-hominem attack has nothing to do with it, so I apologize for attacking all of MIRI’s research and their research agendas choice with it (even if I think talking more about what is and was the right choice still matters)
Part of my failure here has also been to not check for the fact that aggressive writing just feels snappier without much effort. I still think my paragraph starting with “When I’m not frustrated by this situation, I’m just sad.” works pretty well as an independent piece of writing, but it’s obviously needlessly aggressive and spicy, and doesn’t leave any room for the doubt that I actually felt or the doubts I should have felt. My answers after that comment are better, but still riding too much on that tone.
One of the saddest failure (pointed to me by Richard) is that by my tone and my presentation, I made it harder and more aversive for MIRI and EY to share their models, because they have to fear a bit more that kind of reaction. And even if Rob reacted really nicely, I expect that required a bunch of additional mental energy than a better comment wouldn’t have asked for. So I apologize for that, and really want more model-building and discussions from MIRI and EY publicly.
So in summary, my comment should have been something along the line of “Hey, I don’t understand what are your generators for saying that all alignment research is ‘mostly fake or pointless or predictable’, could you give me some pointers to that”. I wasn’t in the head space or had the right handles to frame it that way and not go into weirdly aggressive tangents, and that’s on me.
On the plus side, every other comments on the thread has been high-quality and thoughtful, so here’s a list of the best ones IMO:
Ben Pace’s comment on what success stories for alignment would look like, giving examples.
Rob Bensinger’s comment about the directions of prosaic alignment I wrote I was excited about, and whether they’re “moving the dial”.
Rohin Shah’s comment which frames the outside view of MIRI I was pointing out better than I did and not aggressively.
John Wentworth’s twocomments about the generators of EY’s pessimism being in the sequences all along.
Vaniver’s comment presenting an analysis of why some concrete ML work in alignment doesn’t seem to help for the AGI level.
Rob Bensinger’s comment drawing a great list of distinction to clarify the debate.
Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.
I’m… confused by this framing? Specifically, this bit (as well as other bits like these)
I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like they’re actually moving the ball instead of following the lead of the “figure of authority”.
Some of the brightest and first thinkers on alignment have decided to follow their own nerd-sniping and call everyone else fakers, and when they realized they were not actually making progress, they didn’t switch to something else as much as declare everyone was still full of it
Also, I don’t know how much is related to mental health and pessimism and depression (which I completely understand can color one’s view of the world), but I would love to see the core MIRI team and EY actually try solving alignment with neural nets and prosaic AI. Starting with all their fears and caveats, sure, but then be like “fuck it, let’s just find a new way of grappling it”.
seem to be coming at the problem with [something like] a baked-in assumption that prosaic alignment is something that Actually Has A Chance Of Working?
And, like, to be clear, obviously if you’re working on prosaic alignment that’s going to be something you believe[1]. But it seems clear to me that EY/MIRI does not share this viewpoint, and all the disagreements you have regarding their treatment of other avenues of research seem to me to be logically downstream of this disagreement?
I mean, it’s possible I’m misinterpreting you here. But you’re saying things that (from my perspective) only make sense with the background assumption that “there’s more than one game in town”—things like “I wish EY/MIRI would spend more time engaging with other frames” and “I don’t like how they treat lack of progress in their frame as evidence that all other frames are similarly doomed”—and I feel like all of those arguments simply fail in the world where prosaic alignment is Actually Just Doomed, all the other frames Actually Just Go Nowhere, and conceptual alignment work of the MIRI variety is (more or less) The Only Game In Town.
To be clear: I’m pretty sure you don’t believe we live in that world. But I don’t think you can just export arguments from the world you think we live in to the world EY/MIRI thinks we live in; there needs to be a bridging step first, where you argue about which world we actually live in. I don’t think it makes sense to try and highlight the drawbacks of someone’s approach when they don’t share the same background premises as you, and the background premises they do hold imply a substantially different set of priorities and concerns.
Another thing it occurs to me your frustration could be about is the fact that you can’t actually argue this with EY/MIRI directly, because they don’t frequently make themselves available to discuss things. And if something like that’s the case, then I guess what I want to say is… I sympathize with you abstractly, but I think your efforts are misdirected? It’s okay for you and other alignment researchers to have different background premises from MIRI or even each other, and for you and those other researchers to be working on largely separate agendas as a result? I want to say that’s kind of what foundational research work looks like, in a field where (to a first approximation) nobody has any idea what the fuck they’re doing?
And yes, in the end [assuming somebody succeeds] that will likely mean that a bunch of people’s research directions were ultimately irrelevant. Most people, even. That’s… kind of unavoidable? And also not really the point, because you can’t know which line of research will be successful in advance, so all you have to go on is your best guess, which… may or may not be the same as somebody else’s best guess?
I dunno. I’m trying not to come across as too aggressive here, which is why I’m hedging so many of my claims. To some extent I feel uncomfortable trying to “police” people’s thoughts here, since I’m not actually an alignment researcher… but also it felt to me like your comment was trying to police people’s thoughts, and I don’t actually approve of that either, so...
Yeah. Take this how you will.
[1] I personally am (relatively) agnostic on this question, but as a non-expert in the field my opinion should matter relatively little; I mention this merely as a disclaimer that I am not necessarily on board with EY/MIRI about the doomed-ness of prosaic alignment.
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Okay, so you’re completely right that a lot of my points are logically downstream of the debate on whether Prosaic Alignment is Impossible or not. But I feel like you don’t get how one sided this debate is, and how misrepresented it is here (and generally on the AF)
Like nobody except EY and a bunch of core MIRI people actually believes that prosaic alignment is impossible. I mean that every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard. That includes MIRI people like Evan Hubinger too. And note that some of these other alignment researchers actually work with Neural Nets and keep up to speed on the implementation details and subtleties, which in my book means their voice should count more.
But that’s just a majority argument. The real problem is that nobody has ever given a good argument on why this is impossible. I mean the analogous situation is that a car is driving right at you, accelerating, and you’ve decided somehow that it’s impossible to ever stop it before it kills you. You need a very strong case before giving up like that. And that has not been given by EY and MIRI AFAIK.
The last part of this is that because EY and MIRI founded the field, their view is given far more credibility than what it would have on the basis of the arguments alone, and far more than it has in actual discussions between researchers.
The best analogy I can find (a bit strawmanish but less than you would expect) is a world where somehow the people who had founded the study of cancer had the idea that no method based on biological experimentation and thinking about cells could ever cure cancer, and that the only way of solving it was to understand every dynamics in a very advanced category theoretic model. Then having found the latter really hard, they just say that curing cancer is impossible.
I think one core issue here is that there are actually two debates going on. One is “how hard is the alignment problem?”; another is “how powerful are prosaic alignment techniques?” Broadly speaking, I’d characterise most of the disagreement as being on the first question. But you’re treating it like it’s mostly on the second question—like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.
My attempt to portray EY’s perspective is more like: he’s concerned with the problem of ageing, and a whole bunch of people have come along, said they agree with him, and started proposing ways to cure cancer using prosaic radiotherapy techniques. Now he’s trying to say: no, your work is not addressing the core problem of ageing, which is going to kill us unless we make a big theoretical breakthrough.
Regardless of that, calling the debate “one sided” seems way too strong, especially given how many selection effects are involved. I mean, you could also call the debate about whether alignment is even a problem “one sided” − 95% of all ML researchers don’t think it’s a problem, or think it’s something we’ll solve easily. But for fairly similar meta-level reasons as why it’s good for them to listen to us in an open-minded way, it’s also good for prosaic alignment researchers to listen to EY in an open-minded way. (As a side note, I’d be curious what credence you place on EY’s worldview being more true than the prosaic alignment worldview.)
Now, your complaint might be that MIRI has not made their case enough over the last few years. If that’s the main issue, then stay tuned; as Rob said, this is just the preface to a bunch of relevant material.
95% of all ML researchers don’t think it’s a problem, or think it’s something we’ll solve easily
The 2016 survey of people in AI asked people about the alignment problem as described by Stuart Russell, and 39% said it was an important problem and 33% that it’s a harder problem than most other problem in the field.
I think one core issue here is that there are actually two debates going on. One is “how hard is the alignment problem?”; another is “how powerful are prosaic alignment techniques?” Broadly speaking, I’d characterise most of the disagreement as being on the first question. But you’re treating it like it’s mostly on the second question—like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.
That’s an interesting separation of the problem, because I really feel there is more disagreement on the second question than on the first.
My attempt to portray EY’s perspective is more like: he’s concerned with the problem of ageing, and a whole bunch of people have come along, said they agree with him, and started proposing ways to cure cancer using prosaic radiotherapy techniques. Now he’s trying to say: no, your work is not addressing the core problem of ageing, which is going to kill us unless we make a big theoretical breakthrough.
Funnily, aren’t the people currently working on ageing using quite prosaic techniques? I completely agree that one need to go for the big problems, especially ones that only appear in more powerful regimes (which is why I am adamant that there should be places for researchers to think about distinctly AGI problems and not have to rephrase everything in a way that is palatable to ML academia). But people like Paul and Evan and more are actually going for the core problems IMO, just anchoring a lot of their thinking in current ML technologies. So I have trouble understanding how prosaic alignment isn’t trying to solve the problem at all. Maybe it’s just a disagreement on how large the “prosaic alignment category” is?
Regardless of that, calling the debate “one sided” seems way too strong, especially given how many selection effects are involved. I mean, you could also call the debate about whether alignment is even a problem “one sided” − 95% of all ML researchers don’t think it’s a problem, or think it’s something we’ll solve easily. But for fairly similar meta-level reasons as why it’s good for them to listen to us in an open-minded way, it’s also good for prosaic alignment researchers to listen to EY in an open-minded way.
You definitely have a point, and I want to listen to EY in an open-minded way. It’s just harder when he writes things like everyone working on alignment is faking it and not giving much details. Also I feel that your comparison breaks a bit because compared to the debate with ML researchers (where most people against alignment haven’t even thought about the basics and make obvious mistakes), the other parties in this debate have thought long and hard about alignment. Maybe not as much as EY, but clearly much more than the ML researchers in the whole “is alignment even a problem” debate.
(As a side note, I’d be curious what credence you place on EY’s worldview being more true than the prosaic alignment worldview.)
At the moment I feel like I don’t have a good enough model of EY’s worldview, plus I’m annoyed by his statements, so any credence I give now would be biased against his worldview.
Now, your complaint might be that MIRI has not made their case enough over the last few years. If that’s the main issue, then stay tuned; as Rob said, this is just the preface to a bunch of relevant material.
I really feel there is more disagreement on the second question than on the first
What is this feeling based on? One way we could measure this is by asking people about how much AI xrisk there is conditional on there being no more research explicitly aimed at aligning AGIs. I expect that different people would give very different predictions.
People like Paul and Evan and more are actually going for the core problems IMO, just anchoring a lot of their thinking in current ML technologies.
Everyone agrees that Paul is trying to solve foundational problems. And it seems strange to criticise Eliezer’s position by citing the work of MIRI employees.
It’s just harder when he writes things like everyone working on alignment is faking it and not giving much details.
I worry that “Prosaic Alignment Is Doomed” seems a bit… off as the most appropriate crux. At least for me. It seems hard for someone to justifiably know that this is true with enough confidence to not even try anymore. To have essayed or otherwise precluded all promising paths of inquiry, to not even engage with the rest of the field, to not even try to argue other researchers out of their mistaken beliefs, because it’s all Hopeless.
Consider the following analogy: Someone who wants to gain muscle, but has thought a lot about nutrition and their genetic makeup and concluded that Direct Exercise Gains Are Doomed, and they should expend their energy elsewhere.
OK, maybe. But how about try going to the gym for a month anyways and see what happens?
The point isn’t “EY hasn’t spent a month of work thinking about prosaic alignment.” The point is that AFAICT, by MIRI/EY’s own values, valuable-seeming plans are being left to rot on the cutting room floor. Like,“core MIRI staff meet for an hour each month and attack corrigibility/deceptive cognition/etc with all they’ve got. They pay someone to transcribe the session and post the fruits / negative results / reasoning to AF, without individually committing to following up with comments.”
(I am excited by Rob Bensinger’s comment that this post is the start of more communication from MIRI)
Like nobody except EY and a bunch of core MIRI people actually believes that prosaic alignment is impossible. I mean that every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard. That includes MIRI people like Evan Hubinger too. And note that some of these other alignment researchers actually work with Neural Nets and keep up to speed on the implementation details and subtleties, which in my book means their voice should count more.
I don’t get the impression that Eliezer’s saying that alignment of prosaic AI is impossible. I think he’s saying “it’s almost certainly not going to happen because humans are bad at things.” That seems compatible with “every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard” (if you go with the “very hard” part).
Yes, +1 to this; I think it’s important to distinguish between impossible (which is a term I carefully avoided using in my earlier comment, precisely because of its theoretical implications) and doomed (which I think of as a conjunction of theoretical considerations—how hard is this problem?--and social/coordination ones—how likely is it that humans will have solved this problem before solving AGI?).
I currently view this as consistent with e.g. Eliezer’s claim that Chris Olah’s work, though potentially on a pathway to something important, is probably going to accomplish “far too little far too late”. I certainly didn’t read it as anything like an unconditional endorsement of Chris’ work, as e.g. this comment seems to imply.
Ditto—the first half makes it clear that any strategy which isn’t at most 2 years slower than an unaligned approach will be useless, and that prosaic AI safety falls into that bucket.
Thanks for elaborating. I don’t think I have the necessary familiarity with the alignment research community to assess your characterization of the situation, but I appreciate your willingness to raise potentially unpopular hypotheses to attention. +1
+1 for this whole conversation, including Adam pushing back re prosaic alignment / trying to articulate disagreements! I agree that this is an important thing to talk about more.
I like the ‘give more concrete feedback on specific research directions’ idea, especially if it helps clarify generators for Eliezer’s pessimism. If Eliezer is pessimistic about a bunch of different research approaches simultaneously, and you’re simultaneously optimistic about all those approaches, then there must be some more basic disagreement(s) behind that.
From my perspective, the OP discussion is the opening salvo in ‘MIRI does a lot more model-sharing and discussion’. It’s more like a preface than like a conclusion, and the next topic we plan to focus on is why Eliezer-cluster people think alignment is hard, how we’re thinking about AGI, etc. In the meantime, I’m strongly in favor of arguing about this a bunch in the comments, sharing thoughts and reflections on your own models, etc. -- going straight for the meaty central disagreements now, not waiting to hash this out later.
Someone privately contacted me to express confusion, because they thought my ‘+1’ means that I think adamShimi’s initial comment was unusually great. That’s not the case. The reasons I commented positively are:
I think this overall exchange went well—it raised good points that might have otherwise been neglected, and everyone quickly reached agreement about the real crux.
I want to try to cancel out any impression that criticizing / pushing back on Eliezer-stuff is unwelcome, since Adam expressed worries about a “taboo on criticizing MIRI and EY too hard”.
On a more abstract level, I like seeing people ‘blurt out what they’re actually thinking’ (if done with enough restraint and willingness-to-update to mostly avoid demon threads), even if I disagree with the content of their thought. I think disagreements are often tied up in emotions, or pattern-recognition, or intuitive senses of ‘what a person/group/forum is like’. This can make it harder to epistemically converge about tough topics, because there’s a temptation to pretend your cruxes are more simple and legible than they really are, and end up talking about non-cruxy things.
Separately, I endorse Ben Pace’s question (“Can you make a positive case here for how the work being done on prosaic alignment leads to success?”) as the thing to focus on.
Thanks for the kind answer, even if we’re probably disagreeing about most points in this thread. I think message like yours really help in making everyone aware that such topics can actually be discussed publicly without big backlash.
I like the ‘give more concrete feedback on specific research directions’ idea, especially if it helps clarify generators for Eliezer’s pessimism. If Eliezer is pessimistic about a bunch of different research approaches simultaneously, and you’re simultaneously optimistic about all those approaches, then there must be some more basic disagreement(s) behind that.
That sounds amazing! I definitely want to extract some of the epistemic strategies that EY uses to generate criticisms and break proposals. :)
From my perspective, the OP discussion is the opening salvo in ‘MIRI does a lot more model-sharing and discussion’. It’s more like a preface than like a conclusion, and the next topic we plan to focus on is why Eliezer-cluster people think alignment is hard, how we’re thinking about AGI, etc. In the meantime, I’m strongly in favor of arguing about this a bunch in the comments, sharing thoughts and reflections on your own models, etc. -- going straight for the meaty central disagreements now, not waiting to hash this out later.
Some things that seem important to distinguish here:
‘Prosaic alignment is doomed’. I parse this as: ‘Aligning AGI, without coming up with any fundamentally new ideas about AGI/intelligence or discovering any big “unknown unknowns” about AGI/intelligence, is doomed.’
I (and my Eliezer-model) endorse this, in large part because ML (as practiced today) produces such opaque and uninterpretable models. My sense is that Eliezer’s hopes largely route through understanding AGI systems’ internals better, rather than coming up with cleverer ways to apply external pressures to a black box.
‘All alignment work that involves running experiments on deep nets is doomed’.
My Eliezer-model doesn’t endorse this at all.
Also important to distinguish, IMO (making up the names here):
A strong ‘prosaic AGI’ thesis, like ‘AGI will just be GPT-n or some other scaled-up version of current systems’. Eliezer is extremely skeptical of this.
A weak ‘prosaic AGI’ thesis, like ‘AGI will involve coming up with new techniques, but the path between here and AGI won’t involve any fundamental paradigm shifts and won’t involve us learning any new deep things about intelligence’. I’m not sure what Eliezer’s unconditional view on this is, but I’d guess that he thinks this falls a lot in probability if we condition on something like ‘good outcomes are possible’—it’s very bad news.
An ‘unprosaic but not radically different AGI’ thesis, like ‘AGI might involve new paradigm shifts and/or new deep insights into intelligence, but it will still be similar enough to the current deep learning paradigm that we can potentially learn important stuff about alignable AGI by working with deep nets today’. I don’t think Eliezer has a strong view on this, though I observe that he thinks some of the most useful stuff humanity can do today is ‘run various alignment experiments on deep nets’.
An ‘AGI won’t be GOFAI’ thesis. Eliezer strongly endorses this.
There’s also an ‘inevitability thesis’ that I think is a crux here: my Eliezer-model thinks there are a wide variety of ways to build AGI that are very different, such that it matters a lot which option we steer toward (and various kinds of ‘prosaicness’ might be one parameter we can intervene on, rather than being a constant). My Paul-model has the opposite view, and endorses some version of inevitability.
Your comment and Vaniver’s (paraphrasing) “not surprised by the results of this work, so why do it?” especially helpful. EY (or others) assessing concrete research directions with detailed explanations would be even more helpful.
I agree with Rohin’s general question of “Can you tell a story where your research helps solve a specific alignment problem?”, and if you have other heuristics when assessing research, that would be good to know.
I share the impression that the agent foundations research agenda seemed not that important. But that point doesn’t feel sufficient to argue that Eliezer’s pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I’m not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it’s extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don’t know how exactly they nowadays think about prioritization in 2017 and earlier.
I really like Vaniver’s comment further below:
For what it’s worth, my sense is that EY’s track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.
And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it.
I’m very far away from confident that Eliezer’s pessimism is right, but it seems plausible to me. Of course, some people might be in the epistemic position of having tried to hash out that particular disagreement on the object level and have concluded that Eliezer’s pessimism is misguided – I can’t comment on that. I’m just saying that based on what I’ve read, which is pretty much every post and comment on AI alignment on LW and the EA forum, I don’t get the impression that Eliezer’s pessimism is clearly unfounded.
Everyone’s views look like they are suspiciously shaped to put themselves and their efforts into a good light. If someone believed that their work isn’t important or their strengths aren’t very useful, they wouldn’t do the work and wouldn’t cultivate the strengths. That applies to Eliezer, but it also applies to the people who think alignment will likely be easy. I feel like people in the latter group would likely be inconvenienced (in terms of the usefulness of their personal strengths or the connections they’ve built in the AI industry, or past work they’ve done), too, if it turned out not to be.
Just to give an example on the sorts of observations that make me think Eliezer/”MIRI” could have a point:
I don’t know what happened with a bunch of safety people leaving OpenAI but it’s at least possible to me that it involved some people having had negative updates on the feasibility of a certain type of strategy that Eliezer criticized early on here. (I might be totally wrong about this interpretation because I haven’t talked to anyone involved.)
I thought it was interesting when Paul noted that our civilization’s Covid response was a negative update for him on the feasibility of AI alignment. Kudos to him for noting the update, but also: Isn’t that exactly the sort of misprediction one shouldn’t be making if one confidently thinks alignment is likely to succeed? (That said, my sense is that Paul isn’t even at the most optimistic end of people in the alignment community.)
A lot of the work in the arguments for alignment being easy seems to me to be done by dubious analogies that assume that AI alignment is relevantly similar to risky technologies that we’ve already successfully invented. People seem insufficiently quick to get to the actual crux with MIRI, which makes me think they might not be great at passing the Ideological Turing Test. When we get to the actual crux, it’s somewhere deep inside the domain of predicting the training conditions for AGI, which feels like the sort of thing Eliezer might be good at thinking about. Other people might also be good at thinking about this, but then why do they often start their argument with dubious analogies to past technologies that seem to miss the point? [Edit: I may be strawmanning some people here. I have seen direct discussions about the likelihood of treacherous turns vs. repeated early warnings of alignment failure. I didn’t have a strong opinion either way, but it’s totally possible that some people feel like they understand the argument and confidently disagree with Eliezer’s view there.]
But that point doesn’t feel sufficient to argue that Eliezer’s pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I’m not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.)
I get why you take that from my rant, but that’s not really what I meant. I’m more criticizing the “everything is doomed but let’s not give concrete feedback to people” stance, and I think part of it comes from believing for so long (and maybe still believing) that their own approach was the only non-fake one. Also just calling everyone else a faker is quite disrespectful and not helping.
I also just think it’s extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don’t know how exactly they nowadays think about prioritization in 2017 and earlier.
MIRI does have some positive points for changing their minds, but also some negative points IMO for taking so long to change their mind. Not sure what the total is.
I’m very far away from confident that Eliezer’s pessimism is right, but it seems plausible to me. Of course, some people might be in the epistemic position of having tried to hash out that particular disagreement on the object level and have concluded that Eliezer’s pessimism is misguided – I can’t comment on that. I’m just saying that based on what I’ve read, which is pretty much every post and comment on AI alignment on LW and the EA forum, I don’t get the impression that Eliezer’s pessimism is clearly unfounded.
Here again, it’s not so much that I disagree with EY about there being problems in the current research proposals. I expect that some of the problems he would point out are ones I see too. I just don’t get the transition from “there are problems with all our current ideas” to “everyone is faking working on alignment and we’re all doomed”.
Everyone’s views look like they are suspiciously shaped to put themselves and their efforts into a good light. If someone believed that their work isn’t important or their strengths aren’t very useful, they wouldn’t do the work and wouldn’t cultivate the strengths. That applies to Eliezer, but it also applies to the people who think alignment will likely be easy. I feel like people in the latter group would likely be inconvenienced (in terms of the usefulness of their personal strengths or the connections they’ve built in the AI industry, or past work they’ve done), too, if it turned out not to be.
Very good point. That being said, many of the more prosaic alignment people changed their minds multiple times, whereas on these specific questions I feel EY and MIRI didn’t except when forced by tremendous pressure, which makes me believe that this criticism applies more to them. But that’s one point where having some more knowledge of the internal debates at MIRI could make me change my mind completely.
I don’t know what happened with a bunch of safety people leaving OpenAI but it’s at least possible to me that it involved some people having had negative updates on the feasibility of a certain type of strategy that Eliezer criticized early on here. (I might be totally wrong about this interpretation because I haven’t talked to anyone involved.)
My impression from talking with people (but not having direct confirmation from the people who left) was far more that OpenAI was focusing the conceptual safety team on ML work and the other safety team on making sure GPT-3 was not racist, which was not the type of work they were really excited about. But I might also be totally wrong about this.
I thought it was interesting when Paul noted that our civilization’s Covid response was a negative update for him on the feasibility of AI alignment. Kudos to him for noting the update, but also: Isn’t that exactly the sort of misprediction one shouldn’t be making if one confidently thinks alignment is likely to succeed? (That said, my sense is that Paul isn’t even at the most optimistic end of people in the alignment community.)
I’m confused about your question, because what you describe sounds like a misprediction that makes sense? Also I feel that in this case, there’s a different between solving the coordination problem of having people implement the solution or not go on a race (which looks indeed harder in the light of Covid management) and solving the technical problem, which is orthogonal to Covid response.
My impression from talking with people (but not having direct confirmation from the people who left) was far more that OpenAI was focusing the conceptual safety team on ML work and the other safety team on making sure GPT-3 was not racist, which was not the type of work they were really excited about. But I might also be totally wrong about this.
Interesting! This is quite different from the second-hand accounts I heard. (I assume we’re touching different parts of the elephant.)
First, there is a lot of work in the “alignment community” that involves (for example) decision theory or open-source-game-theory or acausal trade, and I haven’t found any of it helpful for what I personally think about (which I’d like to think is “directly attacking the heart of the problem”, but others may judge for themselves when my upcoming post series comes out!).
I guess I see this subset of work as consistent with the hypothesis “some people have been nerd-sniped!”. But it’s also consistent with “some people have reasonable beliefs and I don’t share them, or maybe I haven’t bothered to understand them”. So I’m a bit loath to go around criticizing them, without putting more work into it. But still, this is a semi-endorsement of one of the things you’re saying.
Second, my understanding of MIRI (as an outsider, based purely on my vague recollection of their newsletters etc., and someone can correct me) is that (1) they have a group working on “better understand agent foundations”, and this group contains Abram and Scott, and they publish pretty much everything they’re doing, (2) they have a group working on undisclosed research projects, which are NOT “better understand agent foundations”, (3)
they have a couple “none of the above” people including Evan and Vanessa. So I’m confused that you seem to endorse what Abram and Scott are doing, but criticize agent foundations work at MIRI.
Like, maybe people “in the AI alignment community” are being nerd-sniped, and maybe MIRI had a historical role in how that happened, but I’m not sure there’s any actual MIRI employee right now who is doing nerd-sniped-type work, to the best of my limited understanding, unless we want to say Scott is, but you already said Scott is OK in your book.
(By the way, hot takes: I join you in finding some of Abram’s posts to be super helpful, and would throw Stuart Armstrong onto the “super helpful” list too, assuming he counts as “MIRI”. As for Scott: ironically, I find logical induction very useful when thinking about how to build AGI, and somewhat less useful when thinking about how to align it. :-P I didn’t get anything useful for my own thinking out of his Cartesian frames or finite factored sets, but as above, that could just be me; I’m very loath to criticize without doing more work, especially as they’re works in progress, I gather.)
I’m annoyed by EY (and maybe MIRI’s?) dismissal of every other alignment work, and how seriously it seems to be taken here, given their track record of choosing research agendas with very indirect impact on alignment
For what it’s worth, my sense is that EY’s track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.
And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it. If you’re trying to build a bridge to the moon, in fact it’s not going to work, and any determination applied there is going to get wasted. I think I see how a “try to understand things and cut to the heart of them” notices when it’s in situations like that, and I don’t see how “move the ball forward from where it is now” notices when it’s in situations like that.
Agreed on the track record, which is part of why that’s so frustrating he doesn’t give more details and feedback on why all these approaches are doomed in his view.
That being said, I disagree for the second part, probably because we don’t mean the same thing by “moving the ball”?
In your bridge example, “moving the ball” looks to me like trying to see what problems the current proposal could have, how you could check them, what would be your unknown unknowns. And I definitely expect such an approach to find the problems you mention.
Maybe you could give me a better model of what you mean by “moving the ball”?
Oh, I was imagining something like “well, our current metals aren’t strong enough, what if we developed stronger ones?”, and then focusing on metallurgy. And this is making forward progress—you can build a taller tower out of steel than out of iron—but it’s missing more fundamental issues like “you’re not going to be able to drive on a bridge that’s perpendicular to gravity, and the direction of gravity will change over the course of the trip” or “the moon moves relative to the earth, such that your bridge won’t be able to be one object”, which will sink the project even if you can find a supremely strong metal.
For example, let’s consider Anca Dragan’s research direction that I’m going to summarize as “getting present-day robots to understand what humans around them want and are doing so that they can achieve their goals / cooperate more effectively.” (In mildly adversarial situations like driving, you don’t want to make a cooperatebot, but rather something that follows established norms / prevents ‘cutting’ and so on, but when you have a human-robot team you do care mostly about effective cooperation.)
My guess is this 1) will make the world a better place in the short run under ‘normal’ conditions (most obviously through speeding up adoption of autonomous vehicles and making them more effective) and 2) does not represent meaningful progress towards aligning transformative AI systems. [My model of Eliezer notes that actually he’s making a weaker claim, which is something more like “he’s not surprised by the results of her papers”, which still allows for them to be “progress in the published literature”.]
When I imagine “how do I move the ball forward now?” I find myself drawn towards projects like those, and less to projects like “stare at the nature of cognition until I see a way through the constraints”, which feels like the sort of thing that I would need to do to actually shift my sense of doom.
Adam, can you make a positive case here for how the work being done on prosaic alignment leads to success? You didn’t make one, and without it I don’t understand where you’re coming from. I’m not asking you to tell me a story that you have 100% probability on, just what is the success story you’re acting under, such that EY’s stances seem to you to be mostly distracting people from the real work.
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Thanks for trying to understand my point and asking me for more details. I appreciate it.
Yet I feel weird when trying to answer, because my gut reaction to your comment is that you’re asking the wrong question? Also, the compression of my view to “EY’s stances seem to you to be mostly distracting people from the real work” sounds more lossy than I’m comfortable with. So let me try to clarify and focus on these feelings and impressions, then I’ll answer more about which success stories or directions excite me.
My current problem with EY’s stances is twofold:
First, in posts like this one, he literally writes that everything done under the label of alignment is faking it and not even attacking the problem, except like 3 people who even if they’re trying have it all wrong. I think this is completely wrong, and that’s even more annoying because I find that most people working on alignment are trying far harder harder to justify why they expect their work to matter than EY and the old-school MIRI team ever did.
This is a problem because it doesn’t help anyone working on the field to maybe solve the problems with their approaches that EY sees, which sounds like a massive missed opportunity.
This is also a problem because EY’s opinions are still quite promoted in the community (especially here on the AF and LW), such that newcomers going for what the founder of the field has to say go away with the impression that no one is doing valuable work.
Far more speculative (because I don’t know EY personally), but I expect that kind of judgment to not come so much from a place of all encompassing genius but instead from generalization after reading some posts/papers. And I’ve received messages following this thread of people who were just as annoyed as I, and felt their results had been dismissed without even a comment or classified as trivial when everyone else, including the authors, were quite surprised by them. I’m ready to give EY a bit of “he just sees further than most people”, but not enough that he can discard the whole field from reading a couple of AF posts.
Second, historically, a lot of MIRI’s work has followed a specific epistemic strategy of trying to understand what are the optimal ways of deciding and thinking, both to predict how an AGI would actually behave and to try to align it. I’m not that convinced by this approach, but even giving it the benefit of the doubt, it has by no way lead to any accomplishments big enough to justify EY (and MIRI’s ?) highly veiled contempt for anyone not doing that. This had and still has many bad impacts on the field and new entrants.
A specific subgroup of people tend to be nerd-sniped by this older MIRI’s work, because it’s the only part of the field that is more formal, but IMO at the loss of most of what matters about alignment and most of the grounding.
People who don’t have the technical skill to work on MIRI’s older work feel like they have to skill up drastically in maths to be able to do anything relevant in alignment. I literally mentored three people like that, who could actually do a lot of good thinking and cared about alignment, and had to push it in their head that they didn’t need super advanced maths skills, except if they wanted to do very very specific things. I find that particularly sad because IMO the biggest positive contribution to the field by EY and early MIRI comes from their less formal and more philosophical work, which is exactly the kind of work that is stilted by the consequences of this stance.
I also feel people here underestimate how repelling this whole attitude has been for years for most people outside the MIRI bubble. From testimonials by a bunch of more ML people and how any discussion of alignment needs to clarify that you don’t share MIRI’s contempt with experimental work and not doing only decision theory and logic, I expect that this has been one of the big factors in alignment not being taken seriously and people not wanting to work on it.
Also important to note that I don’t know if EY and MIRI still think this kind of technical research is highly valuable and the real research and what should be done, but they have been influential enough that I think a big part of the damage is done, and I read some parts of this post as “If only we could do the real logic thing, but we can’t so we’re doomed”. Also there’s a question of the separation between the image that MIRI and EY projects and what they actually think.
Going back to your question, it has a weird double standard feel. Like, every AF post on more prosaic alignment methods comes with its success story, and reason for caring about the research. If EY and MIRI want to argue that we’re all doomed, they have the burden of proof to explain why everything that’s been done is terrible and will never lead to alignment. Once again, proving that we won’t be able to solve a problem is incredibly hard and improbable. Funny how everyone here gets that for the “AGI is impossible question”, but apparently that doesn’t apply to “Actually working with AIs and Thinking about real AIs will never let you solve alignment in time.”
Still, it’s not too difficult to list a bunch of promising stuff, so here’s a non-exhaustive list:
John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.
People from EleutherAI working on understanding LMs and GPT-like models as simulators of processes (called simulacra), as well as the safety benefits (corrigibility) and new strategies (leveraging the output distribution in smart ways) that this model allows.
Evan Hubinger’s work on finding predicates that we could check during training to avoid deception and behaviors we’re worried about. He has a full research agenda but it’s not public yet. Maybe our post on myopic decision theory could be relevant.
Stuart Armstrong’s work on model splintering, especially his AI Safety Subprojects which are experimental, not obvious what they will find, and directly relevant to implementing and using model splintering to solve alignment
Paul Christiano’s recent work on making question-answerers give useful information instead of what they expect humans to answer, which has a clear success story for these kinds of powerful models and their use in building stronger AIs and supervising training for example.
It’s also important to remember how alignment and the related problems and ideas are still not that well explained, distilled and analyzed for teaching and criticism. So I’m excited too about work that isn’t directly solving alignment but just making things clearer and more explicit, like Evan’s recent post or my epistemic strategies analysis.
Thanks for naming specific work you think is really good! I think it’s pretty important here to focus on the object-level. Even if you think the goodness of these particular research directions isn’t cruxy (because there’s a huge list of other things you find promising, and your view is mainly about the list as a whole rather than about any particular items on it), I still think it’s super important for us to focus on object-level examples, since this will probably help draw out what the generators for the disagreement are.
John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.
Eliezer liked this post enough that he asked me to signal-boost it in the MIRI Newsletter back in April.
And Paul Christiano and Stuart Armstrong are two of the people Eliezer named as doing very-unusually good work. We continue to pay Stuart to support his research, though he’s mainly supported by FHI.
And Evan works at MIRI, which provides some Bayesian evidence about how much we tend to like his stuff. :)
So maybe there’s not much disagreement here about what’s relatively good? (Or maybe you’re deliberately picking examples you think should be ‘easy sells’ to Steel Eliezer.)
The main disagreement, of course, is about how absolutely promising this kind of stuff is, not how relatively promising it is. This could be some of the best stuff out there, but my understanding of the Adam/Eliezer disagreement is that it’s about ‘how much does this move the dial on actually saving the world?’ / ‘how much would we move the dial if we just kept doing more stuff like this?’.
Actually, this feels to me like a thing that your comments have bounced off of a bit. From my perspective, Eliezer’s statement was mostly saying ‘the field as a whole is failing at our mission of preventing human extinction; I can name a few tiny tidbits of relatively cool things (not just MIRI stuff, but Olah and Christiano), but the important thing is that in absolute terms the whole thing is not getting us to the world where we actually align the first AGI systems’.
My Eliezer-model thinks nothing (including MIRI stuff) has moved the dial much, relative to the size of the challenge. But your comments have mostly been about a sort of status competition between decision theory stuff and ML stuff, between prosaic stuff and ‘gain new insights into intelligence’ stuff, between MIRI stuff and non-MIRI stuff, etc. This feels to me like it’s ignoring the big central point (‘our work so far is wildly insufficient’) in order to haggle over the exact ordering of the wildly-insufficient things.
You’re zeroed in on the “vast desert” part, but the central point wasn’t about the desert-oasis contrast, it was that the whole thing is (on Eliezer’s model) inadequate to the task at hand. Likewise, you’re talking a lot about the “fake” part (and misstating Eliezer’s view as “everyone else [is] a faker”), when the actual claim was about “work that seems to me to be mostly fake or pointless or predictable” (emphasis added).
Maybe to you these feel similar, because they’re all just different put-downs. But… if those were true descriptions of things about the field, they would imply very different things.
I would like to put forward that Eliezer thinks, in good faith, that this is the best hypothesis that fits the data. I absolutely think reasonable people can disagree with Eliezer on this, and I don’t think we need to posit any bad faith or personality failings to explain why people would disagree.
Also, I feel like I want to emphasize that, like… it’s OK to believe that the field you’re working in is in a bad state? The social pressure against saying that kind of thing (or even thinking it to yourself) is part of why a lot of scientific fields are unhealthy, IMO. I’m in favor of you not takingfor granted that Eliezer’s right, and pushing back insofar as your models disagree with his. But I want to advocate against:
Saying false things about what the other person is saying. A lot of what you’ve said about Eliezer and MIRI is just obviously false (e.g., we have contempt for “experimental work” and think you can’t make progress by “Actually working with AIs and Thinking about real AIs”).
Shrinking the window of ‘socially acceptable things to say about the field as a whole’ (as opposed to unsolicited harsh put-downs of a particular researcher’s work, where I see more value in being cautious).
I want to advocate ‘smack-talking the field is fine, if that’s your honest view; and pushing back is fine, if you disagree with the view’. I want to see more pushing back on the object level (insofar as people disagree), and less ‘how dare you say that, do you think you’re the king of alignment or something’ or ‘saying that will have bad social consequences’.
I think you’re picking up on a real thing of ‘a lot of people are too deferential to various community leaders, when they should be doing more model-building, asking questions, pushing back where they disagree, etc.’ But I think the solution is to shift more of the conversation to object-level argument (that is, modeling the desired behavior), and make that argument as high-quality as possible.
One thing I want to make clear is that I’m quite aware that my comments have not been as high-quality as they should have been. As I wrote in the disclaimer, I was writing from a place of frustration and annoyance, which also implies a focus on more status-y thing. That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed.
That being said, much of what I was railing against is a general perception of the situation, from reading a lot of stuff but not necessarily stopping to study all the evidence before writing a fully though-through opinion. I think this is where the “saying obviously false things” comes from (which I think are pretty easy to believe from just reading this post and a bunch of MIRI write-ups), and why your comments are really important to clarify the discrepancy between this general mental picture I was drawing from and the actual reality. Also recentering the discussion on the object-level instead of on status arguments sounds like a good move.
You make a lot of good points and I definitely want to continue the conversation and have more detailed discussion, but I also feel that for the moment I need to take some steps back, read your comments and some of the pointers in other comments, and think a bit more about the question. I don’t think there’s much more to gain from me answering quickly, mostly in reaction.
(I also had the brilliant idea of starting this thread just when I was on the edge of burning out from working too much (during my holidays), so I’m just going to take some time off from work. But I definitely want to continue this conversation further when I come back, although probably not in this thread ^^)
That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed.
If you’d just aired out your frustration, framing claims about others in NVC-like ‘I feel like...’ terms (insofar as you suspect you wouldn’t reflectively endorse them), and then a bunch of people messaged you in private to say “thank you! you captured my feelings really well”, then that would seem clearly great to me.
I’m a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn’t even have the top-of-comment disclaimer. So that I could imagine some people who also aren’t independently familiar with all the background facts, could come away with a lot of wrong beliefs about the people you’re criticizing.
‘Other people liked my comment, so it was clearly a good thing’ doesn’t distinguish between the worlds where they like it because they share the feelings, vs. agreeing with the factual claims and arguments (and if the latter, whether they’re noticing and filtering out all the seriously false or not-locally-valid parts). If the former, I think it was good. If the latter, I think it was bad.
I’m a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn’t even have the top-of-comment disclaimer. So that I could imagine some people who also aren’t independently familiar with all the background facts, could come away with a lot of wrong beliefs about the people you’re criticizing.
That sounds a bit unfair, in the sense that it makes it look like I just invented stuff I didn’t believe and ran with it. When what actually happen was that I wrote about my frustrations, but made the mistake of stating them as obvious facts instead of impressions.
Of course, I imagine you feel that my portrayal of EY and MIRI was also unfair, sorry about that.
(I added a note to the three most ranty comments on this thread saying that people should mentally add “I feel like...” to judgments in them.)
I’m confused. When I say ‘that’s just my impression’, I mean something like ‘that’s an inside-view belief that I endorse but haven’t carefully vetted’. (See, e.g., Impression Track Records, referring to Naming Beliefs.)
Example: you said that MIRI has “contempt with experimental work and not doing only decision theory and logic”.
My prior guess would have been that you don’t actually, for-real believe that—that it’s not your ‘impression’ in the above sense, more like ‘unendorsed venting/hyperbole that has a more complicated relation to something you really believe’.
If you do (or did) think that’s actually true, then our models of MIRI are much more different than I thought! Alternatively, if you agree this is not true, then that’s all I meant in the previous comment. (Sorry if I was unclear about that.)
I would say that with slight caveats (make “decision theory and logic” a bit larger to include some more mathy stuff and make “all experimental work” a bit smaller to not includes Redwood’s work), this was indeed my model.
What made me update from our discussion is the realization that I interpreted the dismissal of basically all alignment research as “this has no value whatsoever and people doing it are just pretending to care on alignment”, where it should have been interpreted as something like “this is potentially interesting/new/exciting, but it doesn’t look like it brings us closer to solving alignment in a significant way, hence we’re still failing”.
‘Experimental work is categorically bad, but Redwood’s work doesn’t count’ does not sound like a “slight caveat” to me! What does this generalization mean at all if Redwood’s stuff doesn’t count?
(Neither, for that matter, does the difference between ‘decision theory and logic’ and ‘all mathy stuff MIRI has ever focused on’ seem like a ‘slight caveat’ to me—but in that case maybe it’s because I have a lot more non-logic, non-decision-theory examples in my mind that you might not be familiar with, since it sounds like you haven’t read much MIRI stuff?).
(Responding to entire comment thread) Rob, I don’t think you’re modeling what MIRI looks like from the outside very well.
There’s a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)
There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.
There was a non-public agenda that involved Haskell programmers. That’s about all I know about it. For all I know they were doing something similar to the modal logic work I’ve seen in the past.
Eliezer frequently talks about how everyone doing ML work is pursuing dead ends, with potentially the exception of Chris Olah. Chris’s work is not central to the cluster I would call “experimentalist”.
There has been one positive comment on the KL-divergence result in summarizing from human feedback. That wasn’t the main point of that paper and was an extremely predictable result.
There has also been one positive comment on Redwood Research, which was founded by people who have close ties to MIRI. The current steps they are taking are not dramatically different from what other people have been talking about and/or doing.
There was a positive-ish comment on aligning narrowly superhuman models, though iirc it gave off more of an impression of “well, let’s at least die in a slightly more dignified way”.
I don’t particularly agree with Adam’s comments, but it does not surprise me that someone could come to honestly believe the claims within them.
So, the point of my comments was to draw a contrast between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” I didn’t intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as ‘oh yeah, that’s not true and wasn’t what I was thinking’) is not great for discussion.
I think it’s true that ‘MIRI is super not into most ML alignment work’, and I think it used to be true that MIRI put almost all of its research effort into HRAD-ish work, and regardless, this all seems like a completely understandable cached impression to have of current-MIRI. If I wrote stuff that makes it sound like I don’t think those views are common, reasonable, etc., then I apologize for that and disavow the thing I said.
But this is orthogonal to what I thought I was talking about, so I’m confused about what seems to me like a topic switch. Maybe the implied background view here is:
‘Adam’s elision between those two claims was a totally normal level of hyperbole/imprecision, like you might find in any LW comment. Picking on word choices like “only decision theory and logic” versus “only research that’s clustered near decision theory and logic in conceptspace”, or “contempt with experimental work” versus “assigning low EV to typical instances of empirical ML alignment work”, is an isolated demand for rigor that wouldn’t make sense as a general policy and isn’t, in any case, the LW norm.’
So, the point of my comments was to draw a contrast between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” I didn’t intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as ‘oh yeah, that’s not true and wasn’t what I meant’) is not great for discussion.
It occurs to me that part of the problem may be preciselythat Adam et al. don’t think there’s a large difference between these two claims (that actually matters). For example, when I query my (rough, coarse-grained) model of [your typical prosaic alignment optimist], the model in question responds to your statement with something along these lines:
If you remove “mainstream ML alignment work, and nearly all work outside of the HRAD-ish cluster of decision theory, logic, etc.” from “experimental work”, what’s left? Perhaps there are one or two (non-mainstream, barely-pursued) branches of “experimental work” that MIRI endorses and that I’m not aware of—but even if so, that doesn’t seem to me to be sufficient to justify the idea of a large qualitative difference between these two categories.
In a similar vein to the above: perhaps one description is (slightly) hyperbolic and the other isn’t. But I don’t think replacing the hyperbolic version with the non-hyperbolic version would substantially change my assessment of MIRI’s stance; the disagreement feels non-cruxy to me. In light of this, I’m not particularly bothered by either description, and it’s hard for me to understand why you view it as such an important distinction.
Moreover: I don’t think [my model of] the prosaic alignment optimist is being stupid here. I think, to the extent that his words miss an important distinction, it is because that distinction is missing from his very thoughts and framing, not because he happened to use choose his words somewhat carelessly when attempting to describe the situation. Insofar as this is true, I expect him to react to your highlighting of this distinction with (mostly) bemusement, confusion, and possibly even some slight suspicion (e.g. that you’re trying to muddy the waters with irrelevant nitpicking).
To be clear: I don’t think you’re attempting to muddy the waters with irrelevant nitpicking here. I think you think the distinction in question is important because it’s pointing to something real, true, and pertinent—but I also think you’re underestimating how non-obvious this is to people who (A) don’t already deeply understand MIRI’s view, and (B) aren’t in the habit of searching for ways someone’s seemingly pointless statement might actually be right.
I don’t consider myself someone who deeply understands MIRI’s view. But I do want to think of myself as someone who, when confronted with a puzzling statement [from someone whose intellectual prowess I generally respect], searches for ways their statement might be right. So, here is my attempt at describing the real crux behind this disagreement:
(with the caveat that, as always, this is my view, not Rob’s, MIRI’s, or anybody else’s)
(and with the additional caveat that, even if my read of the situation turns out to be correct, I think in general the onus is on MIRI to make sure they are understood correctly, rather than on outsiders to try to interpret them—at least, assuming that MIRI wants to make sure they’re understood correctly, which may not always be the best use of researcher time)
I think the disagreement is mostly about MIRI’s counterfactual behavior, not about their actual behavior. I think most observers (including both Adam and Rob) would agree that MIRI leadership has been largely unenthusiastic about a large class of research that currently falls under the umbrella “experimental work”, and that the amount of work in this class MIRI has been unenthused about significantly outweighs the amount of work they have been excited about.
Where I think Adam and Rob diverge is in their respective models of the generator of this observed behavior. I think Adam (and those who agree with him) thinks that the true boundary of the category [stuff MIRI finds unpromising] roughly coincides with the boundary of the category [stuff most researchers would call “experimental work”], such that anything that comes too close to “running ML experiments and seeing what happens” will be met with an immediate dismissal from MIRI. In other words, [my model of] Adam thinks MIRI’s generator is configured such that the ratio of “experimental work” they find promising-to-unpromising would be roughly the same across many possible counterfactual worlds, even if each of those worlds is doing “experiments” investigating substantially different hypotheses.
Conversely, I think Rob thinks the true boundary of the category [stuff MIRI finds unpromising] is mostly unrelated to the boundary of the category [stuff most researchers would call “experimental work”], and that—to the extent MIRI finds most existing “experimental work” unpromising—this is mostly because the existing work is not oriented along directions MIRI finds promising. In other words, [my model of] Rob thinks MIRI’s generator is configured such that the ratio of “experimental work” they find promising-to-unpromising would vary significantly across counterfactual worlds where researchers investigate different hypotheses; in particular, [my model of] Rob thinks MIRI would find most “experimental work” highly promising in the world where the “experiments” being run are those whose results Eliezer/Nate/etc. would consider difficult to predict in advance, and therefore convey useful information regarding the shape of the alignment problem.
I think Rob’s insistence on maintaining the distinction between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” is in fact an attempt to gesture at the underlying distinction outlined above, and I think that his stringency on this matter makes significantly more sense in light of this. (Though, once again, I note that I could be completely mistaken in everything I just wrote.)
Assuming, however, that I’m (mostly) not mistaken, I think there’s an obvious way forward in terms of resolving the disagreement: try to convey the underlying generators of MIRI’s worldview. In other words, do the thing you were going to do anyway, and save the discussions about word choice for afterwards.
I also think I naturally interpreted the terms in Adam’s comment as pointing to specific clusters of work in today’s world, rather than universal claims about all work that could ever be done. That is, when I see “experimental work and not doing only decision theory and logic”, I automatically think of “experimental work” as pointing to a specific cluster of work that exists in today’s world (which we might call mainstream ML alignment), rather than “any information you can get by running code”. Whereas it seems you interpreted it as something closer to “MIRI thinks there isn’t any information to get by running code”.
My brain insists that my interpretation is the obvious one and is confused how anyone (within the AI alignment field, who knows about the work that is being done) could interpret it as the latter. (Although the existence of non-public experimental work that isn’t mainstream ML is a good candidate for how you would start to interpret “experimental work” as the latter.) But this seems very plausibly a typical mind fallacy.
EDIT: Also, to explicitly say it, sorry for misunderstanding what you were trying to say. I did in fact read your comments as saying “no, MIRI is not categorically against mainstream ML work, and MIRI is not only working on HRAD-ish stuff like decision theory and logic, and furthermore this should be pretty obvious to outside observers”, and now I realize that is not what you were saying.
This is a good comment! I also agree that it’s mostly on MIRI to try to explain its views, not on others to do painstaking exegesis. If I don’t have a ready-on-hand link that clearly articulates the thing I’m trying to say, then it’s not surprising if others don’t have it in their model.
And based on these comments, I update that there’s probably more disagreement-about-MIRI than I was thinking, and less (though still a decent amount of) hyperbole/etc. If so, sorry about jumping to conclusions, Adam!
Not sure if this helps, and haven’t read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I’d have are that:
* For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would. * For almost all experimental results you would think they were so much less informative as to not be worthwhile. * There’s a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.
(I’d be willing to take bets on these or pick candidate experiments to clarify this.)
In addition, a consequence of these beliefs is that compared to me you think we should be spending way more time sitting around thinking about stuff, and way less time doing experiments, than I do.
I would agree with you that “MIRI hates all experimental work” / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.
I would agree with you that “MIRI hates all experimental work” / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.
Ooh, that’s really interesting. Thinking about it, I think my sense of what’s going on is (and I’d be interested to hear how this differs from your sense):
Compared to the average alignment researcher, MIRI tends to put more weight on reasoning like ‘sufficiently capable and general AI is likely to have property X as a strong default, because approximately-X-ish properties don’t seem particularly difficult to implement (e.g., they aren’t computationally intractable), and we can see from first principles that agents will be systematically less able to get what they want when they lack property X’. My sense is that MIRI puts more weight on arguments like this for reasons like:
We’re more impressed with the track record of inside-view reasoning in science.
I suspect this is partly because the average alignment researcher is impressed with how unusually-poorly inside-view reasoning has done in AI—many have tried to gain a deep understanding of intelligence, and many have failed—whereas (for various reasons) MIRI is less impressed with this, and defaults more to the base rate for other fields, where inside-view reasoning has more extraordinary feats under its belt.
We’re more wary of “modest epistemology”, which we think often acts like a self-fulfilling prophecy. (You don’t practice trying to mechanistically model everything yourself, you despair of overcoming biases, you avoid thinking thoughts that would imply you’re a leader or pioneer because that feels arrogant, so you don’t gain as much skill or feedback in those areas.)
Compared to the average alignment researcher, MIRI tends to put less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed’. This is for a variety of reasons, including:
MIRI is more generally wary of putting much weight on surface generalizations, if we don’t have an inside-view reason to expect the generalization to keep holding.
MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.
Relatedly, MIRI thinks AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked. Progress on understanding AGI is much harder to predict than progress on hardware, so we can’t derive as much from trends.
Applying this to experiments:
Some predictions I’d have are that:
* For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would. * For almost all experimental results you would think they were so much less informative as to not be worthwhile. * There’s a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.
I’d have the same prediction, though I’m less confident that ‘pessimism about experiments’ is doing much work here, vs. ‘pessimism about alignment’. To distinguish the two, I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you (though probably the gap will be smaller?).
I do expect there to be some experiment-specific effect. I don’t know your views well, but if your views are sufficiently like my mental model of ‘typical alignment researcher whose intuitions differ a lot from MIRI’s’, then my guess would be that the disagreement comes down to the above two factors.
1 (more trust in inside view): For many experiments, I’m imagining Eliezer saying ‘I predict the outcome will be X’, and then the outcome is X, and the Modal Alignment Researcher says: ‘OK, but now we’ve validated your intuition—you should be much more confident, and that update means the experiment was still useful.’
To which Hypothetical Eliezer says: ‘I was already more than confident enough. Empirical data is great—I couldn’t have gotten this confident without years of honing my models and intuitions through experience—but now that I’m there, I don’t need to feign modesty and pretend I’m uncertain about everything until I see it with my own eyes.’
2 (less trust in AGI sticking to trends): For many obvious ML experiments Eliezer can’t predict the outcome of, I expect Eliezer to say ‘This experiment isn’t relevant, because factors X, Y, and Z give us strong reason to think that the thing we learn won’t generalize to AGI.’
Which ties back in to 1 as well, because if you don’t think we can build very reliable models in AI without constant empirical feedback, you’ll rarely be confident of abstract reasons X/Y/Z to expect a difference between current ML and AGI, since you can’t go walk up to an AGI today and observe what it’s like.
(You also won’t be confident that X/Y/Z don’t hold—all the possibilities will seem plausible until AGI is actually here, because you generally don’t trust yourself to reason your way to conclusions with much confidence.)
Thanks. For time/brevity, I’ll just say which things I agree / disagree with:
> sufficiently capable and general AI is likely to have property X as a strong default [...]
I generally agree with this, although for certain important values of X (such as “fooling humans for instrumental reasons”) I’m probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I’m also probably more optimistic (but not certain) that those efforts will succeed.
[inside view, modest epistemology]: I don’t have a strong take on either of these. My main take on inside views is that they are great for generating interesting and valuable hypotheses, but usually wrong on the particulars.
> less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed
> MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.
Probably disagree but hard to tell. I think there will both be a lot of similarities and a lot of differences.
> AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked
Seems pretty wrong to me. We probably need both insight and hardware, but the insights themselves are hardware-bottlenecked: once you can easily try lots of stuff and see what happens, insights are much easier, see Crick on x-ray crystallography for historical support (ctrl+f for Crick).
> I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you
I’m more pessimistic than MIRI about HRAD, though that has selection effects. I’ve found conceptual work to be pretty helpful for pointing to where problems might exist, but usually relatively confused about how to address them or how specifically they’re likely to manifest. (Which is to say, overall highly valuable, but consistent with my take above on inside views.)
[experiments are either predictable or uninformative]: Seems wrong to me. As a concrete example: Do larger models have better or worse OOD generalization? I’m not sure if you’d pick “predictable” or “uninformative”, but my take is: * The outcome wasn’t predictable: within ML there are many people who would have taken each side. (I personally was on the wrong side, i.e. predicting “worse”.) * It’s informative, for two reasons: (1) It shows that NNs “automatically” generalize more than I might have thought, and (2) Asymptotically, we expect the curve to eventually reverse, so when does that happen and how can we study it?
> Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
Would agree with “most”, but I think you probably meant something like “almost all”, which seems wrong. There’s lots of people working on interpretability, and some of the work seems quite good to me (aside from Chris, I think Noah Goodman, Julius Adebayo, and some others are doing pretty good work).
I’m not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count “the same”) had any clear-to-my-model relevance to alignment, or even AGI. AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we’ve got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem. Bigger models generalizing better or worse doesn’t say anything obvious to any piece of my model of the Big Problem. Though if larger models start generalizing more poorly, then it takes longer to stupidly-brute-scale to AGI, which I suppose affects timelines some, but that just takes timelines from unpredictable to unpredictable sooo.
If we qualify an experiment as interesting when it can tell anyone about anything, then there’s an infinite variety of experiments “interesting” in this sense and I could generate an unlimited number of them. But I do restrict my interest to experiments which can not only tell me something I don’t know, but tell me something relevant that I don’t know. There is also something to be said for opening your eyes and staring around, but even then, there’s an infinity of blank faraway canvases to stare at, and the trick is to go wandering with your eyes wide open someplace you might see something really interesting. Others will be puzzled and interested by different things and I don’t wish them ill on their spiritual journeys, but I don’t expect the vast majority of them to return bearing enlightenment that I’m at all interested in, though now and then Miles Brundage tweets something (from far outside of EA) that does teach me some little thing about cognition.
I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.
Do larger models generalize better or more poorly OOD? It’s a relatively basic question as such things go, and no doubt of interest to many, and may even update our timelines from ‘unpredictable’ to ‘unpredictable’, but… I’m trying to figure out how to say this, and I think I should probably accept that there’s no way to say it that will stop people from trying to sell other bits of research as Super Relevant To Alignment… it’s okay to have an understanding of reality which makes narrower guesses than that about which projects will turn out to be very relevant.
I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.
Trying to rephrase it in my own words (which will necessarily lose some details), are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now? Which might tell us for example “what aspect of these predictable problems crop up first, and why?”
are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now?
It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like “What actually went into the nonviolence predicate instead of just nonviolence?” Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.
I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/understandability/etc. in the first AGI systems. Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about ‘looking in the learned model’s brain’ to spot mesa-optimization and analyze what that optimization is ultimately ‘trying to do’ (or whatever), then you probably won’t produce stuff that MIRI’s excited about regardless of how experimental vs. theoretical your work is.
(In which case, maybe this is not actually a crux for the usefulness of most alignment experiments, and is instead a crux for the usefulness of most alignment research in general.)
(I suspect there are a bunch of other disagreements going into this too, including basic divergences on questions like ‘What’s even the point of aligning AGI? What should humanity do with aligned AGI once it has it?’.)
One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)
I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has… disconcerting effects on my own level of optimism.
Not planning to answer more on this thread, but given how my last messages seem to have confused you, here is my last attempt of sharing my mental model (so you can flag in an answer where I’m wrong in your opinion for readers of this thread)
Also, I just checked on the publication list, and I’ve read or skimmed most things MIRI published since 2014 (including most newsletters and blog posts on MIRI website).
My model of MIRI is that initially, there was a bunch of people including EY who were working mostly on decision theory stuff, tiling, model theory, the sort of stuff I was pointing at. That predates Nate’s arrival, but in my model it becomes far more legible after that (so circa 2014/2015). In my model, I call that “old school MIRI”, and that was a big chunk of what I was pointing out in my original comment.
Then there are a bunch of thing that seem to have happened:
Newer people (Abram and Scott come to mind, but mostly because they’re the one who post on the AF and who I’ve talked to) join this old-school MIRI approach and reshape it into Embedded Agency. Now this new agenda is a bit different from the old-school MIRI work, but I feel like it’s still not that far from decision theory and logic (with maybe a stronger emphasis on the bayesian part for stuff like logical induction). That might be a part where we’re disagreeing.
A direction related to embedded agency and the decision theory and logic stuff, but focused on implementations through strongly typed programming languages like Haskell and type theory. That’s technically practical, but in my mental model this goes in the same category as “decision theory and logic stuff”, especially because that sort of programming is very close to logic and natural deduction.
MIRI starts it’s ML-focused agenda, which you already mentioned. The impression I still have is that this didn’t lead to much published work that was actually experimental, instead focusing on recasting questions of alignment through ML theory. But I’ve updated towards thinking MIRI has invested efforts into looking at stuff from a more prosaic angle, based on looking more into what has been published there, because some of these ML papers had flown under my radar (there’s also the difficulty that when I read a paper by someone who has a position elsewhere now — say Ryan Carey or Stuart Armstrong — I don’t think MIRI but I think of their current affiliation, even though the work was supported by MIRI (and apparently Stuart is still supported by MIRI)). This is the part of the model where I expect that we might have very different models because of your knowledge of what was being done internally and never released.
Some new people hired by MIRI fall into what I call the “Bells Lab MIRI” model, where MIRI just hires/funds people that have different approaches from them, but who they think are really bright (Evan and Vanessa come to mind, although I don’t know if that’s the though process that went into hiring them).
Based on that model and some feedback and impressions I’ve gathered from people of some MIRI researchers being very doubtful of experimental work, that lead to my “all experimental work is useless”. I tried to include Redwood and Chris Olah’s work in there with the caveat (which is a weird model but makes sense if you have a strong prior for “experimental work is useless for MIRI”).
Our discussion made me think that there’s probably far better generators for this general criticism of experimental work, and that they would actually make more sense than “experimental work is useless except this and that”.
From testimonials by a bunch of more ML people and how any discussion of alignment needs to clarify that you don’t share MIRI’s contempt with experimental work and not doing only decision theory and logic
If you were in the situation described by The Rocket Alignment Problem, you could think “working with rockets right now isn’t useful, we need to focus on our conceptual confusions about more basic things” without feeling inherently contemptuous of experimentalism—it’s a tool in the toolbox (which may or may not be appropriate to the task at hand), not a low- or high-status activity on a status hierarchy.
Separately, I think MIRI has always been pretty eager to run experiments in software when they saw an opportunity to test important questions that way. It’s also been 4.5 years now since we announced that we were shifting a lot of resources away from Agent Foundations and into new stuff, and 3 years since we wrote a very long (though still oblique) post about that research, talking about its heavy focus on running software experiments. Though we also made sure to say:
In a sense, you can think of our new research as tackling the same sort of problem that we’ve always been attacking, but from new angles. In other words, if you aren’t excited about logical inductors or functional decision theory, you probably wouldn’t be excited by our new work either.
I don’t think you can say MIRI has “contempt with experimental work” after four years of us mainly focusing on experimental work. There are other disagreements here, but this ties in to a long-standing objection I have to false dichotomies like:
‘we can either do prosaic alignment, or run no experiments’
‘we can either do prosaic alignment, or ignore deep learning’
‘we can either think it’s useful to improve our theoretical understanding of formal agents in toy settings, or think it’s useful to run experiments’
‘we can either think the formal agents work is useful, or think it’s useful to work with state-of-the-art ML systems’
I don’t think Eliezer’s criticism of the field is about experimentalism. I do think it’s heavily about things like ‘the field focuses too much on putting external pressures on black boxes, rather than trying to open the black box’, because (a) he doesn’t think those external-pressures approaches are viable (absent a strong understanding of what’s going on inside the box), and (b) he sees the ‘open the black box’ type work as the critical blocker. (Hence his relative enthusiasm for Chris Olah’s work, which, you’ll notice, is about deep learning and not about decision theory.)
… I find that most people working on alignment are trying far harder harder to justify why they expect their work to matter than EY and the old-school MIRI team ever did.
You’ve had a few comments along these lines in this thread, and I think this is where you’re most severely failing to see the situation from Yudkowsky’s point of view.
From Yudkowsky’s view, explaining and justifying MIRI’s work (and the processes he uses to reach such judgements more generally) was the main point of the sequences. He has written more on the topic than anyone else in the world, by a wide margin. He basically spent several years full-time just trying to get everyone up to speed, because the inductive gap was very very wide.
When I put on my Yudkowsky hat and look at both the OP and your comments through that lens… I imagine if I were Yudkowsky I’d feel pretty exasperated at this point. Like, he’s written a massive volume on the topic, and now ten years later a large chunk of people haven’t even bothered to read it. (In particular, I know (because it’s come up in conversation) that at least a few of the people who talk about prosaic alignment a lot haven’t read the sequences, and I suspect that a disproportionate number haven’t. I don’t mean to point fingers or cast blame here, the sequences are a lot of material and most of it is not legibly relevant before reading it all, but if you haven’t read the sequences and you’re wondering why MIRI doesn’t have a write-up on why they’re not excited about prosaic alignment… well, that’s kinda the write-up. Also I feel like I need a disclaimer here that many people excited about prosaic alignment have read the sequences, I definitely don’t mean to imply that this is everyone in the category.)
(To be clear, I don’t think the sequences explain all of the pieces behind Yudkowsky’s views of prosaic alignment, in depth. They were written for a different use-case. But I do think they explain a lot.)
Related: IMO the best roughly-up-to-date piece explaining the Yudkowsky/MIRI viewpoint is The Rocket Alignment Problem.
You’ve had a few comments along these lines in this thread, and I think this is where you’re most severely failing to see the situation from Yudkowsky’s point of view.
From Yudkowsky’s view, explaining and justifying MIRI’s work (and the processes he uses to reach such judgements more generally) was the main point of the sequences. He has written more on the topic than anyone else in the world, by a wide margin. He basically spent several years full-time just trying to get everyone up to speed, because the inductive gap was very very wide.
My memory of the sequences is that it’s far more about defending and explaining the alignment problem than criticizing prosaic AGI (maybe because the term couldn’t have been used years before Paul coined it?). Could you give me the best pointers of prosaic Alignment criticism in the sequence? I(I’ve read the sequences, but I don’t remember every single post, and my impression for memory is what I’ve written above).
I feel also that there might be a discrepancy between who I think of when I think of prosaic alignment researchers and what the category means in general/to most people here? My category mostly includes AF posters, people from a bunch of places like EleutherAI/OpenAI/DeepMind/Anthropic/Redwood and people from CHAI and FHI. I expect most of these people to actually have read the sequences, and tried to understand MIRI’s perspective. Maybe someone could point out a list of other places where prosaic alignment research is being done that I’m missing, especially places where people probably haven’t read the sequences? Or maybe I’m over estimating how many of the people in the places I mentioned have read the sequences?
I don’t mean to say that there’s critique of prosaic alignment specifically in the sequences. Rather, a lot of the generators of the Yudkowsky-esque worldview are in there. (That is how the sequences work: it’s not about arguing specific ideas around alignment, it’s about explaining enough of the background frames and generators that the argument becomes unnecessary. “Raise the sanity waterline” and all that.)
For instance, just the other day I ran across this:
Of this I learn the lesson: You cannot manipulate confusion. You cannot make clever plans to work around the holes in your understanding. You can’t even make “best guesses” about things which fundamentally confuse you, and relate them to other confusing things. Well, you can, but you won’t get it right, until your confusion dissolves. Confusion exists in the mind, not in the reality, and trying to treat it like something you can pick up and move around, will only result in unintentional comedy.
Similarly, you cannot come up with clever reasons why the gaps in your model don’t matter. You cannot draw a border around the mystery, put on neat handles that let you use the Mysterious Thing without really understanding it—like my attempt to make the possibility that life is meaningless cancel out of an expected utility formula. You can’t pick up the gap and manipulate it.
If the blank spot on your map conceals a land mine, then putting your weight down on that spot will be fatal, no matter how good your excuse for not knowing. Any black box could contain a trap, and there’s no way to know except opening up the black box and looking inside. If you come up with some righteous justification for why you need to rush on ahead with the best understanding you have—the trap goes off.
(The earlier part of the post had a couple embarrassing stories of mistakes Yudkowsky made earlier, which is where the lesson came from.) Reading that, I was like, “man that sure does sound like the Yudkowsky-esque viewpoint on prosaic alignment”.
Or maybe I’m over estimating how many of the people in the places I mentioned have read the sequences?
I think you are overestimating. At the orgs you list, I’d guess at least 25% and probably more than half have not read the sequences. (Low confidence/wide error bars, though.)
Thank you for the links Adam. To clarify, the kind of argument I’m really looking for is something like the following three (hypothetical) examples.
Mesa-optimization is the primary threat model of unaligned AGI systems. Over the next few decades there will be a lot of companies building ML systems that create mesa-optimizers. I think it is within 5 years of current progress that we will understand how ML systems create mesa-optimizers and how to stop it.Therefore I think the current field is adequate for the problem (80%).
When I look at the research we’re outputting, it seems to me to me that we are producing research at a speed and flexibility faster than any comparably sized academic department globally, or the ML industry, and so I am much more hopeful that we’re able to solve our difficult problem before the industry builds an unaligned AGI. I give it a 25% probability, which I suspect is much higher than Eliezer’s.
I basically agree the alignment problem is hard and unlikely to be solved, but I don’t think we have any alternative than the current sorts of work being done, which is a combo of (a) agent foundations work (b) designing theoretical training algorithms (like Paul is) or (c) directly aligning narrowly super intelligent models. I am pretty open to Eliezer’s claim that we will fail but I see no alternative plan to pour resources into.
Whatever you actually think about the field and how it will save the world, say it!
It seems to me that almost all of your the arguments you’ve made work whether the field is a failure or not. The debate here has to pass through whether the field is on-track or not, and we must not sidestep that conversation.
I want to leave this paragraph as social acknowledgment that you mentioned upthread that you’re tired and taking a break, and I want to give you a bunch of social space to not return to this thread for however long you need to take! Slow comments are often the best.
I’m glad that I posted my inflammatory comment, if only because exchanging with you and Rob made me actually consider the question of “what is our story to success”, instead of just “are we making progress/creating valuable knowledge”. And the way you two have been casting it is way less aversive to me that the way EY tends to frame it. This is definitely something I want to think more about. :)
I want to leave this paragraph as social acknowledgment that you mentioned upthread that you’re tired and taking a break, and I want to give you a bunch of social space to not return to this thread for however long you need to take! Slow comments are often the best.
I have sympathy for the “this feels somewhat contemptuous” reading, but I want to push back a bit on the “EY contemptuously calling nearly everyone fakers” angle, because I think “[thinly] veiled contempt” is an uncharitable reading. He could be simply exasperated about the state of affairs, or wishing people would change their research directions but respect them as altruists for Trying At All, or who knows what? I’d rather not overwrite his intentions with our reactions (although it is mostly the job of the writer to ensure their writing communicates the right information [although part of the point of the website discussion was to speak frankly and bluntly]).
If superintelligence is approximately multimodal GPT-17 plus reinforcement learning, then understanding how GPT-3-scale algorithms function is exceptionally important to understanding super-intelligence.
Also, if superintelligence doesn’t happen then prosaic alignment is the only kind of alignment.
Also, if superintelligence doesn’t happen then prosaic alignment is the only kind of alignment.
Why do you think this? On the original definition of prosaic alignment, I don’t see why this would be true.
(In case it clarifies anything: my understanding of Paul’s research program is that it’s all about trying to achieve prosaic alignment for superintelligence. ‘Prosaic’ was never meant to imply ‘dumb’, because Paul thinks current techniques will eventually scale to very high capability levels.)
My thinking is that prosaic alignment can also apply to non-super intelligent systems. If multimodal GPT-17 + RL = superintelligence, then whatever techniques are involved with aligning that system would probably apply to multimodal GPT-3 + RL, despite not being superintelligence. Superintelligence is not a prerequisite for being alignable.
This is already reflected in the upvotes, but just to say it explicitly: I think the replies to this comment from Rob and dxu in particular have been exceptionally charitable and productive; kudos to them. This seems like a very good case study in responding to a provocative framing with a concentration of positive discussion norms that leads to productive engagement.
if EY and other MIRI people who are very dubious of most alignment research could give more feedback on that and enter the dialogue, maybe by commenting more on the AF. My problem is not so much with them disagreeing with most of the work, it’s about the disagreement stopping to “that’s not going to work” and not having dialogue and back and forth.
Nerd sniping is a slang term that describes a particularly interesting problem that is presented to a nerd, often a physicist, tech geek or mathematician. The nerd stops all activity to devote attention to solving the problem, often at his or her own peril
My original exposure to LW drove me away in large part because issues you describe. I would also add (at least circa 2010) you needed to have a near-deistic belief in the anti-messianic emergence of some AGI so powerful that it can barely be described in terms of human notions of “intelligence.”
EDIT: This comment fails on a lot of points, as discussed in this apology subcomment. I encourage people interested by the thread to mostly read the apology subcomment and the list of comments linked there, which provide maximum value with minimum drama IMO.
Disclaimer: this is a rant. In the best possible world, I could write from a calmer place, but I’m pretty sure that the taboo on criticizing MIRI and EY too hard on the AF can only be pushed through when I’m annoyed enough. That being said, I’m writing down thoughts that I had for quite some time, so don’t just discard this as a gut reaction to the post.
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Tl;dr:
I’m annoyed by EY (and maybe MIRI’s?) dismissal of every other alignment work, and how seriously it seems to be taken here, given their track record of choosing research agendas with very indirect impact on alignment, and of taking a lot of time to let go of these flawed agendas in the face of mounting evidence.
I’m annoyed that I have to deal with the nerd-sniping introduced by MIRI when bringing new people to the field, especially given the status imbalance.
I’m sad that EY and MIRI’s response to their research agenda not being as promising as they wanted is “we’re all doomed”.
Honestly, I really, really tried to find how MIRI’s Agents Foundations agenda was supposed to help with alignment. I really did. Some people tried to explain it to me. And I wanted to believe, because logic and maths are amazing tools with which to attack this most important problem, alignment. Yet I can’t escape the fact that the only contributions to technical alignment I can see by MIRI have been done by a bunch of people who mostly do their own thing instead of following MIRI’s core research program: Evan, Vanessa, Abram and Scott. (Note that this is my own judgement, and I haven’t talked to these people about this comment, so if you disagree with me, please don’t go at them).
All the rest, including some of the things these people worked on (but not most of it), is nerd-sniping as far as I’m concerned. It’s a tricky failure mode because it looks like good and serious research to the AF and LW audience. But there’s still a massive difference with actually trying to solve the real problems related to alignment, with all the tools at our disposal, and deciding that the focus should be on a handful of toy problems neatly expressed with decision theory and logic, and then only work on those.
That’s already bad enough. But then we have posts like this one, where EY just dunks on everyone working on alignment as fakers, or having terrible ideas. And at that point, I really wonder: why is that supposed to be an argument of authority anymore? Yes, I give massive credibility points to EY and MIRI for starting the field of alignment almost by themselves, and for articulating a lot of the issues. Yet all of the work that looked actually pushed by the core MIRI team (and based on some of EY’s work) from MIRI’s beginning are just toying with logic problems with hardly any connections here and there to alignment. (I know they’re not publishing most of it, but that judgment applies to their currently published work, and from the agenda and the blog posts, it sounds like most of the unpublished work was definitely along those lines). Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.
Note that this also has massive downsides for conceptual alignment in general, because when bringing people in, you have to deal with this specter of nerd-sniping by the founding lab of the field and still a figure of authority. I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like they’re actually moving the ball instead of following the lead of the “figure of authority”.
When I’m not frustrated by this situation, I’m just sad. Some of the brightest and first thinkers on alignment have decided to follow their own nerd-sniping and call everyone else fakers, and when they realized they were not actually making progress, they didn’t switch to something else as much as declare everyone was still full of it and so because they had no idea of how to solve the problem at the moment, it was doomed.
What could be done differently? Well, I would really, really like if EY and other MIRI people who are very dubious of most alignment research could give more feedback on that and enter the dialogue, maybe by commenting more on the AF. My problem is not so much with them disagreeing with most of the work, it’s about the disagreement stopping to “that’s not going to work” and not having dialogue and back and forth.
Also, I don’t know how much is related to mental health and pessimism and depression (which I completely understand can color one’s view of the world), but I would love to see the core MIRI team and EY actually try solving alignment with neural nets and prosaic AI. Starting with all their fears and caveats, sure, but then be like “fuck it, let’s just find a new way of grappling it”. That’s really what saddens me the most about all of this: I feel that some of the best current minds who care about alignment have sort of given up on actually trying to solve it.
This is an apology for the tone and the framing of the above comment (and my following answers), which have both been needlessly aggressive, status-focused and uncharitable. Underneath are still issues that matter a lot to me, but others have discussed them better (I’ll provide a list of linked comments at the end of this one).
Thanks to Richard Ngo for convincing me that I actually needed to write such an apology, which was probably the needed push for me to stop weaseling around it.
So what did I do wrong? The list is pretty damning:
I took something about the original post that I didn’t understand — EY’s “And then there is, so far as I can tell, a vast desert full of work that seems to me to be mostly fake or pointless or predictable.” — and because it didn’t make sense to me, and because that fitted with my stereotypes for MIRI and EY’s dismissiveness of a lot of work in alignment, I turned to an explanation of this as an attack on alignment researchers, saying they were consciously faking it when they knew they should do better. Whereas I feel know that what EY meant is far closer to alignment research at the moment is trying to try to align AI as best as we can, instead of just trying to do it. I’m still not sure if I agree with that characterization, but that sounds far more like something that can be discussed.
There’s also a weird aspect of status-criticism to my comment that I think I completely failed to explain. Looking at my motives now (let’s be wary of hindsight...), I feel like my issue with the status things was more that a bunch of people other than EY and MIRI just take what they say as super strong evidence without looking at all the arguments and details, and thus I expected this post and recent MIRI publications to create a background of “we’re doomed” for a lot of casual observers, with the force of the status of EY and MIRI.
But I don’t want to say that EY and MIRI are given too much status in general in the community, even if I actually wrote something along those lines. I guess it’s just easier to focus your criticism on the beacon of status than on the invisible crowd misusing status. Sorry about that.
I somehow turned that into an attack of MIRI’s research (at least a chunk of it), which didn’t really have anything to do with it. That probably was just the manifestation of my frustration when people come to the field and feel like they shouldn’t do the experimental research that they fill better suited for or feel like they need to learn a lot of advanced maths. Even if those are not official MIRI positions, I definitely feel MIRI has had a big influence on them. And yet, maybe newcomers should question themselves that way. It always sounded like a loss of potential to me, because the outcome is often to not do alignment; but maybe even if you’re into experiments, the best way you could align AIs now doesn’t go through that path (and you could still find that exciting enough to find new research).
Whatever the correct answer is, my weird ad-hominem attack has nothing to do with it, so I apologize for attacking all of MIRI’s research and their research agendas choice with it (even if I think talking more about what is and was the right choice still matters)
Part of my failure here has also been to not check for the fact that aggressive writing just feels snappier without much effort. I still think my paragraph starting with “When I’m not frustrated by this situation, I’m just sad.” works pretty well as an independent piece of writing, but it’s obviously needlessly aggressive and spicy, and doesn’t leave any room for the doubt that I actually felt or the doubts I should have felt. My answers after that comment are better, but still riding too much on that tone.
One of the saddest failure (pointed to me by Richard) is that by my tone and my presentation, I made it harder and more aversive for MIRI and EY to share their models, because they have to fear a bit more that kind of reaction. And even if Rob reacted really nicely, I expect that required a bunch of additional mental energy than a better comment wouldn’t have asked for.
So I apologize for that, and really want more model-building and discussions from MIRI and EY publicly.
So in summary, my comment should have been something along the line of “Hey, I don’t understand what are your generators for saying that all alignment research is ‘mostly fake or pointless or predictable’, could you give me some pointers to that”. I wasn’t in the head space or had the right handles to frame it that way and not go into weirdly aggressive tangents, and that’s on me.
On the plus side, every other comments on the thread has been high-quality and thoughtful, so here’s a list of the best ones IMO:
Ben Pace’s comment on what success stories for alignment would look like, giving examples.
Rob Bensinger’s comment about the directions of prosaic alignment I wrote I was excited about, and whether they’re “moving the dial”.
Rohin Shah’s comment which frames the outside view of MIRI I was pointing out better than I did and not aggressively.
John Wentworth’s two comments about the generators of EY’s pessimism being in the sequences all along.
Vaniver’s comment presenting an analysis of why some concrete ML work in alignment doesn’t seem to help for the AGI level.
Rob Bensinger’s comment drawing a great list of distinction to clarify the debate.
Thank you for this follow-up comment Adam, I appreciate it.
I’m… confused by this framing? Specifically, this bit (as well as other bits like these)
seem to be coming at the problem with [something like] a baked-in assumption that prosaic alignment is something that Actually Has A Chance Of Working?
And, like, to be clear, obviously if you’re working on prosaic alignment that’s going to be something you believe[1]. But it seems clear to me that EY/MIRI does not share this viewpoint, and all the disagreements you have regarding their treatment of other avenues of research seem to me to be logically downstream of this disagreement?
I mean, it’s possible I’m misinterpreting you here. But you’re saying things that (from my perspective) only make sense with the background assumption that “there’s more than one game in town”—things like “I wish EY/MIRI would spend more time engaging with other frames” and “I don’t like how they treat lack of progress in their frame as evidence that all other frames are similarly doomed”—and I feel like all of those arguments simply fail in the world where prosaic alignment is Actually Just Doomed, all the other frames Actually Just Go Nowhere, and conceptual alignment work of the MIRI variety is (more or less) The Only Game In Town.
To be clear: I’m pretty sure you don’t believe we live in that world. But I don’t think you can just export arguments from the world you think we live in to the world EY/MIRI thinks we live in; there needs to be a bridging step first, where you argue about which world we actually live in. I don’t think it makes sense to try and highlight the drawbacks of someone’s approach when they don’t share the same background premises as you, and the background premises they do hold imply a substantially different set of priorities and concerns.
Another thing it occurs to me your frustration could be about is the fact that you can’t actually argue this with EY/MIRI directly, because they don’t frequently make themselves available to discuss things. And if something like that’s the case, then I guess what I want to say is… I sympathize with you abstractly, but I think your efforts are misdirected? It’s okay for you and other alignment researchers to have different background premises from MIRI or even each other, and for you and those other researchers to be working on largely separate agendas as a result? I want to say that’s kind of what foundational research work looks like, in a field where (to a first approximation) nobody has any idea what the fuck they’re doing?
And yes, in the end [assuming somebody succeeds] that will likely mean that a bunch of people’s research directions were ultimately irrelevant. Most people, even. That’s… kind of unavoidable? And also not really the point, because you can’t know which line of research will be successful in advance, so all you have to go on is your best guess, which… may or may not be the same as somebody else’s best guess?
I dunno. I’m trying not to come across as too aggressive here, which is why I’m hedging so many of my claims. To some extent I feel uncomfortable trying to “police” people’s thoughts here, since I’m not actually an alignment researcher… but also it felt to me like your comment was trying to police people’s thoughts, and I don’t actually approve of that either, so...
Yeah. Take this how you will.
[1] I personally am (relatively) agnostic on this question, but as a non-expert in the field my opinion should matter relatively little; I mention this merely as a disclaimer that I am not necessarily on board with EY/MIRI about the doomed-ness of prosaic alignment.
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Okay, so you’re completely right that a lot of my points are logically downstream of the debate on whether Prosaic Alignment is Impossible or not. But I feel like you don’t get how one sided this debate is, and how misrepresented it is here (and generally on the AF)
Like nobody except EY and a bunch of core MIRI people actually believes that prosaic alignment is impossible. I mean that every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard. That includes MIRI people like Evan Hubinger too. And note that some of these other alignment researchers actually work with Neural Nets and keep up to speed on the implementation details and subtleties, which in my book means their voice should count more.
But that’s just a majority argument. The real problem is that nobody has ever given a good argument on why this is impossible. I mean the analogous situation is that a car is driving right at you, accelerating, and you’ve decided somehow that it’s impossible to ever stop it before it kills you. You need a very strong case before giving up like that. And that has not been given by EY and MIRI AFAIK.
The last part of this is that because EY and MIRI founded the field, their view is given far more credibility than what it would have on the basis of the arguments alone, and far more than it has in actual discussions between researchers.
The best analogy I can find (a bit strawmanish but less than you would expect) is a world where somehow the people who had founded the study of cancer had the idea that no method based on biological experimentation and thinking about cells could ever cure cancer, and that the only way of solving it was to understand every dynamics in a very advanced category theoretic model. Then having found the latter really hard, they just say that curing cancer is impossible.
I think one core issue here is that there are actually two debates going on. One is “how hard is the alignment problem?”; another is “how powerful are prosaic alignment techniques?” Broadly speaking, I’d characterise most of the disagreement as being on the first question. But you’re treating it like it’s mostly on the second question—like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.
My attempt to portray EY’s perspective is more like: he’s concerned with the problem of ageing, and a whole bunch of people have come along, said they agree with him, and started proposing ways to cure cancer using prosaic radiotherapy techniques. Now he’s trying to say: no, your work is not addressing the core problem of ageing, which is going to kill us unless we make a big theoretical breakthrough.
Regardless of that, calling the debate “one sided” seems way too strong, especially given how many selection effects are involved. I mean, you could also call the debate about whether alignment is even a problem “one sided” − 95% of all ML researchers don’t think it’s a problem, or think it’s something we’ll solve easily. But for fairly similar meta-level reasons as why it’s good for them to listen to us in an open-minded way, it’s also good for prosaic alignment researchers to listen to EY in an open-minded way. (As a side note, I’d be curious what credence you place on EY’s worldview being more true than the prosaic alignment worldview.)
Now, your complaint might be that MIRI has not made their case enough over the last few years. If that’s the main issue, then stay tuned; as Rob said, this is just the preface to a bunch of relevant material.
The 2016 survey of people in AI asked people about the alignment problem as described by Stuart Russell, and 39% said it was an important problem and 33% that it’s a harder problem than most other problem in the field.
Thanks for the detailed comment!
That’s an interesting separation of the problem, because I really feel there is more disagreement on the second question than on the first.
Funnily, aren’t the people currently working on ageing using quite prosaic techniques? I completely agree that one need to go for the big problems, especially ones that only appear in more powerful regimes (which is why I am adamant that there should be places for researchers to think about distinctly AGI problems and not have to rephrase everything in a way that is palatable to ML academia). But people like Paul and Evan and more are actually going for the core problems IMO, just anchoring a lot of their thinking in current ML technologies. So I have trouble understanding how prosaic alignment isn’t trying to solve the problem at all. Maybe it’s just a disagreement on how large the “prosaic alignment category” is?
You definitely have a point, and I want to listen to EY in an open-minded way. It’s just harder when he writes things like everyone working on alignment is faking it and not giving much details. Also I feel that your comparison breaks a bit because compared to the debate with ML researchers (where most people against alignment haven’t even thought about the basics and make obvious mistakes), the other parties in this debate have thought long and hard about alignment. Maybe not as much as EY, but clearly much more than the ML researchers in the whole “is alignment even a problem” debate.
At the moment I feel like I don’t have a good enough model of EY’s worldview, plus I’m annoyed by his statements, so any credence I give now would be biased against his worldview.
Yeah, excited about that!
What is this feeling based on? One way we could measure this is by asking people about how much AI xrisk there is conditional on there being no more research explicitly aimed at aligning AGIs. I expect that different people would give very different predictions.
Everyone agrees that Paul is trying to solve foundational problems. And it seems strange to criticise Eliezer’s position by citing the work of MIRI employees.
As Rob pointed out above, this straightforwardly mischaracterises what Eliezer said.
I worry that “Prosaic Alignment Is Doomed” seems a bit… off as the most appropriate crux. At least for me. It seems hard for someone to justifiably know that this is true with enough confidence to not even try anymore. To have essayed or otherwise precluded all promising paths of inquiry, to not even engage with the rest of the field, to not even try to argue other researchers out of their mistaken beliefs, because it’s all Hopeless.
Consider the following analogy: Someone who wants to gain muscle, but has thought a lot about nutrition and their genetic makeup and concluded that Direct Exercise Gains Are Doomed, and they should expend their energy elsewhere.
OK, maybe. But how about try going to the gym for a month anyways and see what happens?
The point isn’t “EY hasn’t spent a month of work thinking about prosaic alignment.” The point is that AFAICT, by MIRI/EY’s own values, valuable-seeming plans are being left to rot on the cutting room floor. Like, “core MIRI staff meet for an hour each month and attack corrigibility/deceptive cognition/etc with all they’ve got. They pay someone to transcribe the session and post the fruits / negative results / reasoning to AF, without individually committing to following up with comments.”
(I am excited by Rob Bensinger’s comment that this post is the start of more communication from MIRI)
I don’t get the impression that Eliezer’s saying that alignment of prosaic AI is impossible. I think he’s saying “it’s almost certainly not going to happen because humans are bad at things.” That seems compatible with “every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard” (if you go with the “very hard” part).
Yes, +1 to this; I think it’s important to distinguish between impossible (which is a term I carefully avoided using in my earlier comment, precisely because of its theoretical implications) and doomed (which I think of as a conjunction of theoretical considerations—how hard is this problem?--and social/coordination ones—how likely is it that humans will have solved this problem before solving AGI?).
I currently view this as consistent with e.g. Eliezer’s claim that Chris Olah’s work, though potentially on a pathway to something important, is probably going to accomplish “far too little far too late”. I certainly didn’t read it as anything like an unconditional endorsement of Chris’ work, as e.g. this comment seems to imply.
Ditto—the first half makes it clear that any strategy which isn’t at most 2 years slower than an unaligned approach will be useless, and that prosaic AI safety falls into that bucket.
Thanks for elaborating. I don’t think I have the necessary familiarity with the alignment research community to assess your characterization of the situation, but I appreciate your willingness to raise potentially unpopular hypotheses to attention. +1
Thanks for taking the time of asking a question about the discussion even if you lack expertise on the topic. ;)
+1 for this whole conversation, including Adam pushing back re prosaic alignment / trying to articulate disagreements! I agree that this is an important thing to talk about more.
I like the ‘give more concrete feedback on specific research directions’ idea, especially if it helps clarify generators for Eliezer’s pessimism. If Eliezer is pessimistic about a bunch of different research approaches simultaneously, and you’re simultaneously optimistic about all those approaches, then there must be some more basic disagreement(s) behind that.
From my perspective, the OP discussion is the opening salvo in ‘MIRI does a lot more model-sharing and discussion’. It’s more like a preface than like a conclusion, and the next topic we plan to focus on is why Eliezer-cluster people think alignment is hard, how we’re thinking about AGI, etc. In the meantime, I’m strongly in favor of arguing about this a bunch in the comments, sharing thoughts and reflections on your own models, etc. -- going straight for the meaty central disagreements now, not waiting to hash this out later.
Someone privately contacted me to express confusion, because they thought my ‘+1’ means that I think adamShimi’s initial comment was unusually great. That’s not the case. The reasons I commented positively are:
I think this overall exchange went well—it raised good points that might have otherwise been neglected, and everyone quickly reached agreement about the real crux.
I want to try to cancel out any impression that criticizing / pushing back on Eliezer-stuff is unwelcome, since Adam expressed worries about a “taboo on criticizing MIRI and EY too hard”.
On a more abstract level, I like seeing people ‘blurt out what they’re actually thinking’ (if done with enough restraint and willingness-to-update to mostly avoid demon threads), even if I disagree with the content of their thought. I think disagreements are often tied up in emotions, or pattern-recognition, or intuitive senses of ‘what a person/group/forum is like’. This can make it harder to epistemically converge about tough topics, because there’s a temptation to pretend your cruxes are more simple and legible than they really are, and end up talking about non-cruxy things.
Separately, I endorse Ben Pace’s question (“Can you make a positive case here for how the work being done on prosaic alignment leads to success?”) as the thing to focus on.
Thanks for the kind answer, even if we’re probably disagreeing about most points in this thread. I think message like yours really help in making everyone aware that such topics can actually be discussed publicly without big backlash.
That sounds amazing! I definitely want to extract some of the epistemic strategies that EY uses to generate criticisms and break proposals. :)
Excited about that!
I don’t think the “Only Game in Town” argument works when EY in the OP says
As well as approving redwood research.
Some things that seem important to distinguish here:
‘Prosaic alignment is doomed’. I parse this as: ‘Aligning AGI, without coming up with any fundamentally new ideas about AGI/intelligence or discovering any big “unknown unknowns” about AGI/intelligence, is doomed.’
I (and my Eliezer-model) endorse this, in large part because ML (as practiced today) produces such opaque and uninterpretable models. My sense is that Eliezer’s hopes largely route through understanding AGI systems’ internals better, rather than coming up with cleverer ways to apply external pressures to a black box.
‘All alignment work that involves running experiments on deep nets is doomed’.
My Eliezer-model doesn’t endorse this at all.
Also important to distinguish, IMO (making up the names here):
A strong ‘prosaic AGI’ thesis, like ‘AGI will just be GPT-n or some other scaled-up version of current systems’. Eliezer is extremely skeptical of this.
A weak ‘prosaic AGI’ thesis, like ‘AGI will involve coming up with new techniques, but the path between here and AGI won’t involve any fundamental paradigm shifts and won’t involve us learning any new deep things about intelligence’. I’m not sure what Eliezer’s unconditional view on this is, but I’d guess that he thinks this falls a lot in probability if we condition on something like ‘good outcomes are possible’—it’s very bad news.
An ‘unprosaic but not radically different AGI’ thesis, like ‘AGI might involve new paradigm shifts and/or new deep insights into intelligence, but it will still be similar enough to the current deep learning paradigm that we can potentially learn important stuff about alignable AGI by working with deep nets today’. I don’t think Eliezer has a strong view on this, though I observe that he thinks some of the most useful stuff humanity can do today is ‘run various alignment experiments on deep nets’.
An ‘AGI won’t be GOFAI’ thesis. Eliezer strongly endorses this.
There’s also an ‘inevitability thesis’ that I think is a crux here: my Eliezer-model thinks there are a wide variety of ways to build AGI that are very different, such that it matters a lot which option we steer toward (and various kinds of ‘prosaicness’ might be one parameter we can intervene on, rather than being a constant). My Paul-model has the opposite view, and endorses some version of inevitability.
Note: GOFAI = Good Old Fashioned AI
Your comment and Vaniver’s (paraphrasing) “not surprised by the results of this work, so why do it?” especially helpful. EY (or others) assessing concrete research directions with detailed explanations would be even more helpful.
I agree with Rohin’s general question of “Can you tell a story where your research helps solve a specific alignment problem?”, and if you have other heuristics when assessing research, that would be good to know.
+1, plus endorsing Chris Olah
I share the impression that the agent foundations research agenda seemed not that important. But that point doesn’t feel sufficient to argue that Eliezer’s pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I’m not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it’s extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don’t know how exactly they nowadays think about prioritization in 2017 and earlier.
I really like Vaniver’s comment further below:
I’m very far away from confident that Eliezer’s pessimism is right, but it seems plausible to me. Of course, some people might be in the epistemic position of having tried to hash out that particular disagreement on the object level and have concluded that Eliezer’s pessimism is misguided – I can’t comment on that. I’m just saying that based on what I’ve read, which is pretty much every post and comment on AI alignment on LW and the EA forum, I don’t get the impression that Eliezer’s pessimism is clearly unfounded.
Everyone’s views look like they are suspiciously shaped to put themselves and their efforts into a good light. If someone believed that their work isn’t important or their strengths aren’t very useful, they wouldn’t do the work and wouldn’t cultivate the strengths. That applies to Eliezer, but it also applies to the people who think alignment will likely be easy. I feel like people in the latter group would likely be inconvenienced (in terms of the usefulness of their personal strengths or the connections they’ve built in the AI industry, or past work they’ve done), too, if it turned out not to be.
Just to give an example on the sorts of observations that make me think Eliezer/”MIRI” could have a point:
I don’t know what happened with a bunch of safety people leaving OpenAI but it’s at least possible to me that it involved some people having had negative updates on the feasibility of a certain type of strategy that Eliezer criticized early on here. (I might be totally wrong about this interpretation because I haven’t talked to anyone involved.)
I thought it was interesting when Paul noted that our civilization’s Covid response was a negative update for him on the feasibility of AI alignment. Kudos to him for noting the update, but also: Isn’t that exactly the sort of misprediction one shouldn’t be making if one confidently thinks alignment is likely to succeed? (That said, my sense is that Paul isn’t even at the most optimistic end of people in the alignment community.)
A lot of the work in the arguments for alignment being easy seems to me to be done by dubious analogies that assume that AI alignment is relevantly similar to risky technologies that we’ve already successfully invented. People seem insufficiently quick to get to the actual crux with MIRI, which makes me think they might not be great at passing the Ideological Turing Test. When we get to the actual crux, it’s somewhere deep inside the domain of predicting the training conditions for AGI, which feels like the sort of thing Eliezer might be good at thinking about. Other people might also be good at thinking about this, but then why do they often start their argument with dubious analogies to past technologies that seem to miss the point?
[Edit: I may be strawmanning some people here. I have seen direct discussions about the likelihood of treacherous turns vs. repeated early warnings of alignment failure. I didn’t have a strong opinion either way, but it’s totally possible that some people feel like they understand the argument and confidently disagree with Eliezer’s view there.]
That’s an awesome comment, thanks!
I get why you take that from my rant, but that’s not really what I meant. I’m more criticizing the “everything is doomed but let’s not give concrete feedback to people” stance, and I think part of it comes from believing for so long (and maybe still believing) that their own approach was the only non-fake one. Also just calling everyone else a faker is quite disrespectful and not helping.
MIRI does have some positive points for changing their minds, but also some negative points IMO for taking so long to change their mind. Not sure what the total is.
Here again, it’s not so much that I disagree with EY about there being problems in the current research proposals. I expect that some of the problems he would point out are ones I see too. I just don’t get the transition from “there are problems with all our current ideas” to “everyone is faking working on alignment and we’re all doomed”.
Very good point. That being said, many of the more prosaic alignment people changed their minds multiple times, whereas on these specific questions I feel EY and MIRI didn’t except when forced by tremendous pressure, which makes me believe that this criticism applies more to them. But that’s one point where having some more knowledge of the internal debates at MIRI could make me change my mind completely.
My impression from talking with people (but not having direct confirmation from the people who left) was far more that OpenAI was focusing the conceptual safety team on ML work and the other safety team on making sure GPT-3 was not racist, which was not the type of work they were really excited about. But I might also be totally wrong about this.
I’m confused about your question, because what you describe sounds like a misprediction that makes sense? Also I feel that in this case, there’s a different between solving the coordination problem of having people implement the solution or not go on a race (which looks indeed harder in the light of Covid management) and solving the technical problem, which is orthogonal to Covid response.
Interesting! This is quite different from the second-hand accounts I heard. (I assume we’re touching different parts of the elephant.)
Couple things:
First, there is a lot of work in the “alignment community” that involves (for example) decision theory or open-source-game-theory or acausal trade, and I haven’t found any of it helpful for what I personally think about (which I’d like to think is “directly attacking the heart of the problem”, but others may judge for themselves when my upcoming post series comes out!).
I guess I see this subset of work as consistent with the hypothesis “some people have been nerd-sniped!”. But it’s also consistent with “some people have reasonable beliefs and I don’t share them, or maybe I haven’t bothered to understand them”. So I’m a bit loath to go around criticizing them, without putting more work into it. But still, this is a semi-endorsement of one of the things you’re saying.
Second, my understanding of MIRI (as an outsider, based purely on my vague recollection of their newsletters etc., and someone can correct me) is that (1) they have a group working on “better understand agent foundations”, and this group contains Abram and Scott, and they publish pretty much everything they’re doing, (2) they have a group working on undisclosed research projects, which are NOT “better understand agent foundations”, (3) they have a couple “none of the above” people including Evan and Vanessa. So I’m confused that you seem to endorse what Abram and Scott are doing, but criticize agent foundations work at MIRI.
Like, maybe people “in the AI alignment community” are being nerd-sniped, and maybe MIRI had a historical role in how that happened, but I’m not sure there’s any actual MIRI employee right now who is doing nerd-sniped-type work, to the best of my limited understanding, unless we want to say Scott is, but you already said Scott is OK in your book.
(By the way, hot takes: I join you in finding some of Abram’s posts to be super helpful, and would throw Stuart Armstrong onto the “super helpful” list too, assuming he counts as “MIRI”. As for Scott: ironically, I find logical induction very useful when thinking about how to build AGI, and somewhat less useful when thinking about how to align it. :-P I didn’t get anything useful for my own thinking out of his Cartesian frames or finite factored sets, but as above, that could just be me; I’m very loath to criticize without doing more work, especially as they’re works in progress, I gather.)
For what it’s worth, my sense is that EY’s track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.
And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it. If you’re trying to build a bridge to the moon, in fact it’s not going to work, and any determination applied there is going to get wasted. I think I see how a “try to understand things and cut to the heart of them” notices when it’s in situations like that, and I don’t see how “move the ball forward from where it is now” notices when it’s in situations like that.
Agreed on the track record, which is part of why that’s so frustrating he doesn’t give more details and feedback on why all these approaches are doomed in his view.
That being said, I disagree for the second part, probably because we don’t mean the same thing by “moving the ball”?
In your bridge example, “moving the ball” looks to me like trying to see what problems the current proposal could have, how you could check them, what would be your unknown unknowns. And I definitely expect such an approach to find the problems you mention.
Maybe you could give me a better model of what you mean by “moving the ball”?
Oh, I was imagining something like “well, our current metals aren’t strong enough, what if we developed stronger ones?”, and then focusing on metallurgy. And this is making forward progress—you can build a taller tower out of steel than out of iron—but it’s missing more fundamental issues like “you’re not going to be able to drive on a bridge that’s perpendicular to gravity, and the direction of gravity will change over the course of the trip” or “the moon moves relative to the earth, such that your bridge won’t be able to be one object”, which will sink the project even if you can find a supremely strong metal.
For example, let’s consider Anca Dragan’s research direction that I’m going to summarize as “getting present-day robots to understand what humans around them want and are doing so that they can achieve their goals / cooperate more effectively.” (In mildly adversarial situations like driving, you don’t want to make a cooperatebot, but rather something that follows established norms / prevents ‘cutting’ and so on, but when you have a human-robot team you do care mostly about effective cooperation.)
My guess is this 1) will make the world a better place in the short run under ‘normal’ conditions (most obviously through speeding up adoption of autonomous vehicles and making them more effective) and 2) does not represent meaningful progress towards aligning transformative AI systems. [My model of Eliezer notes that actually he’s making a weaker claim, which is something more like “he’s not surprised by the results of her papers”, which still allows for them to be “progress in the published literature”.]
When I imagine “how do I move the ball forward now?” I find myself drawn towards projects like those, and less to projects like “stare at the nature of cognition until I see a way through the constraints”, which feels like the sort of thing that I would need to do to actually shift my sense of doom.
Adam, can you make a positive case here for how the work being done on prosaic alignment leads to success? You didn’t make one, and without it I don’t understand where you’re coming from. I’m not asking you to tell me a story that you have 100% probability on, just what is the success story you’re acting under, such that EY’s stances seem to you to be mostly distracting people from the real work.
(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))
Thanks for trying to understand my point and asking me for more details. I appreciate it.
Yet I feel weird when trying to answer, because my gut reaction to your comment is that you’re asking the wrong question? Also, the compression of my view to “EY’s stances seem to you to be mostly distracting people from the real work” sounds more lossy than I’m comfortable with. So let me try to clarify and focus on these feelings and impressions, then I’ll answer more about which success stories or directions excite me.
My current problem with EY’s stances is twofold:
First, in posts like this one, he literally writes that everything done under the label of alignment is faking it and not even attacking the problem, except like 3 people who even if they’re trying have it all wrong. I think this is completely wrong, and that’s even more annoying because I find that most people working on alignment are trying far harder harder to justify why they expect their work to matter than EY and the old-school MIRI team ever did.
This is a problem because it doesn’t help anyone working on the field to maybe solve the problems with their approaches that EY sees, which sounds like a massive missed opportunity.
This is also a problem because EY’s opinions are still quite promoted in the community (especially here on the AF and LW), such that newcomers going for what the founder of the field has to say go away with the impression that no one is doing valuable work.
Far more speculative (because I don’t know EY personally), but I expect that kind of judgment to not come so much from a place of all encompassing genius but instead from generalization after reading some posts/papers. And I’ve received messages following this thread of people who were just as annoyed as I, and felt their results had been dismissed without even a comment or classified as trivial when everyone else, including the authors, were quite surprised by them. I’m ready to give EY a bit of “he just sees further than most people”, but not enough that he can discard the whole field from reading a couple of AF posts.
Second, historically, a lot of MIRI’s work has followed a specific epistemic strategy of trying to understand what are the optimal ways of deciding and thinking, both to predict how an AGI would actually behave and to try to align it. I’m not that convinced by this approach, but even giving it the benefit of the doubt, it has by no way lead to any accomplishments big enough to justify EY (and MIRI’s ?) highly veiled contempt for anyone not doing that. This had and still has many bad impacts on the field and new entrants.
A specific subgroup of people tend to be nerd-sniped by this older MIRI’s work, because it’s the only part of the field that is more formal, but IMO at the loss of most of what matters about alignment and most of the grounding.
People who don’t have the technical skill to work on MIRI’s older work feel like they have to skill up drastically in maths to be able to do anything relevant in alignment. I literally mentored three people like that, who could actually do a lot of good thinking and cared about alignment, and had to push it in their head that they didn’t need super advanced maths skills, except if they wanted to do very very specific things.
I find that particularly sad because IMO the biggest positive contribution to the field by EY and early MIRI comes from their less formal and more philosophical work, which is exactly the kind of work that is stilted by the consequences of this stance.
I also feel people here underestimate how repelling this whole attitude has been for years for most people outside the MIRI bubble. From testimonials by a bunch of more ML people and how any discussion of alignment needs to clarify that you don’t share MIRI’s contempt with experimental work and not doing only decision theory and logic, I expect that this has been one of the big factors in alignment not being taken seriously and people not wanting to work on it.
Also important to note that I don’t know if EY and MIRI still think this kind of technical research is highly valuable and the real research and what should be done, but they have been influential enough that I think a big part of the damage is done, and I read some parts of this post as “If only we could do the real logic thing, but we can’t so we’re doomed”. Also there’s a question of the separation between the image that MIRI and EY projects and what they actually think.
Going back to your question, it has a weird double standard feel. Like, every AF post on more prosaic alignment methods comes with its success story, and reason for caring about the research. If EY and MIRI want to argue that we’re all doomed, they have the burden of proof to explain why everything that’s been done is terrible and will never lead to alignment. Once again, proving that we won’t be able to solve a problem is incredibly hard and improbable. Funny how everyone here gets that for the “AGI is impossible question”, but apparently that doesn’t apply to “Actually working with AIs and Thinking about real AIs will never let you solve alignment in time.”
Still, it’s not too difficult to list a bunch of promising stuff, so here’s a non-exhaustive list:
John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.
People from EleutherAI working on understanding LMs and GPT-like models as simulators of processes (called simulacra), as well as the safety benefits (corrigibility) and new strategies (leveraging the output distribution in smart ways) that this model allows.
Evan Hubinger’s work on finding predicates that we could check during training to avoid deception and behaviors we’re worried about. He has a full research agenda but it’s not public yet. Maybe our post on myopic decision theory could be relevant.
Stuart Armstrong’s work on model splintering, especially his AI Safety Subprojects which are experimental, not obvious what they will find, and directly relevant to implementing and using model splintering to solve alignment
Paul Christiano’s recent work on making question-answerers give useful information instead of what they expect humans to answer, which has a clear success story for these kinds of powerful models and their use in building stronger AIs and supervising training for example.
It’s also important to remember how alignment and the related problems and ideas are still not that well explained, distilled and analyzed for teaching and criticism. So I’m excited too about work that isn’t directly solving alignment but just making things clearer and more explicit, like Evan’s recent post or my epistemic strategies analysis.
Thanks for naming specific work you think is really good! I think it’s pretty important here to focus on the object-level. Even if you think the goodness of these particular research directions isn’t cruxy (because there’s a huge list of other things you find promising, and your view is mainly about the list as a whole rather than about any particular items on it), I still think it’s super important for us to focus on object-level examples, since this will probably help draw out what the generators for the disagreement are.
Eliezer liked this post enough that he asked me to signal-boost it in the MIRI Newsletter back in April.
And Paul Christiano and Stuart Armstrong are two of the people Eliezer named as doing very-unusually good work. We continue to pay Stuart to support his research, though he’s mainly supported by FHI.
And Evan works at MIRI, which provides some Bayesian evidence about how much we tend to like his stuff. :)
So maybe there’s not much disagreement here about what’s relatively good? (Or maybe you’re deliberately picking examples you think should be ‘easy sells’ to Steel Eliezer.)
The main disagreement, of course, is about how absolutely promising this kind of stuff is, not how relatively promising it is. This could be some of the best stuff out there, but my understanding of the Adam/Eliezer disagreement is that it’s about ‘how much does this move the dial on actually saving the world?’ / ‘how much would we move the dial if we just kept doing more stuff like this?’.
Actually, this feels to me like a thing that your comments have bounced off of a bit. From my perspective, Eliezer’s statement was mostly saying ‘the field as a whole is failing at our mission of preventing human extinction; I can name a few tiny tidbits of relatively cool things (not just MIRI stuff, but Olah and Christiano), but the important thing is that in absolute terms the whole thing is not getting us to the world where we actually align the first AGI systems’.
My Eliezer-model thinks nothing (including MIRI stuff) has moved the dial much, relative to the size of the challenge. But your comments have mostly been about a sort of status competition between decision theory stuff and ML stuff, between prosaic stuff and ‘gain new insights into intelligence’ stuff, between MIRI stuff and non-MIRI stuff, etc. This feels to me like it’s ignoring the big central point (‘our work so far is wildly insufficient’) in order to haggle over the exact ordering of the wildly-insufficient things.
You’re zeroed in on the “vast desert” part, but the central point wasn’t about the desert-oasis contrast, it was that the whole thing is (on Eliezer’s model) inadequate to the task at hand. Likewise, you’re talking a lot about the “fake” part (and misstating Eliezer’s view as “everyone else [is] a faker”), when the actual claim was about “work that seems to me to be mostly fake or pointless or predictable” (emphasis added).
Maybe to you these feel similar, because they’re all just different put-downs. But… if those were true descriptions of things about the field, they would imply very different things.
I would like to put forward that Eliezer thinks, in good faith, that this is the best hypothesis that fits the data. I absolutely think reasonable people can disagree with Eliezer on this, and I don’t think we need to posit any bad faith or personality failings to explain why people would disagree.
Also, I feel like I want to emphasize that, like… it’s OK to believe that the field you’re working in is in a bad state? The social pressure against saying that kind of thing (or even thinking it to yourself) is part of why a lot of scientific fields are unhealthy, IMO. I’m in favor of you not taking for granted that Eliezer’s right, and pushing back insofar as your models disagree with his. But I want to advocate against:
Saying false things about what the other person is saying. A lot of what you’ve said about Eliezer and MIRI is just obviously false (e.g., we have contempt for “experimental work” and think you can’t make progress by “Actually working with AIs and Thinking about real AIs”).
Shrinking the window of ‘socially acceptable things to say about the field as a whole’ (as opposed to unsolicited harsh put-downs of a particular researcher’s work, where I see more value in being cautious).
I want to advocate ‘smack-talking the field is fine, if that’s your honest view; and pushing back is fine, if you disagree with the view’. I want to see more pushing back on the object level (insofar as people disagree), and less ‘how dare you say that, do you think you’re the king of alignment or something’ or ‘saying that will have bad social consequences’.
I think you’re picking up on a real thing of ‘a lot of people are too deferential to various community leaders, when they should be doing more model-building, asking questions, pushing back where they disagree, etc.’ But I think the solution is to shift more of the conversation to object-level argument (that is, modeling the desired behavior), and make that argument as high-quality as possible.
Thanks for your great comments!
One thing I want to make clear is that I’m quite aware that my comments have not been as high-quality as they should have been. As I wrote in the disclaimer, I was writing from a place of frustration and annoyance, which also implies a focus on more status-y thing. That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed.
That being said, much of what I was railing against is a general perception of the situation, from reading a lot of stuff but not necessarily stopping to study all the evidence before writing a fully though-through opinion. I think this is where the “saying obviously false things” comes from (which I think are pretty easy to believe from just reading this post and a bunch of MIRI write-ups), and why your comments are really important to clarify the discrepancy between this general mental picture I was drawing from and the actual reality. Also recentering the discussion on the object-level instead of on status arguments sounds like a good move.
You make a lot of good points and I definitely want to continue the conversation and have more detailed discussion, but I also feel that for the moment I need to take some steps back, read your comments and some of the pointers in other comments, and think a bit more about the question. I don’t think there’s much more to gain from me answering quickly, mostly in reaction.
(I also had the brilliant idea of starting this thread just when I was on the edge of burning out from working too much (during my holidays), so I’m just going to take some time off from work. But I definitely want to continue this conversation further when I come back, although probably not in this thread ^^)
Enjoy your rest! :)
If you’d just aired out your frustration, framing claims about others in NVC-like ‘I feel like...’ terms (insofar as you suspect you wouldn’t reflectively endorse them), and then a bunch of people messaged you in private to say “thank you! you captured my feelings really well”, then that would seem clearly great to me.
I’m a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn’t even have the top-of-comment disclaimer. So that I could imagine some people who also aren’t independently familiar with all the background facts, could come away with a lot of wrong beliefs about the people you’re criticizing.
‘Other people liked my comment, so it was clearly a good thing’ doesn’t distinguish between the worlds where they like it because they share the feelings, vs. agreeing with the factual claims and arguments (and if the latter, whether they’re noticing and filtering out all the seriously false or not-locally-valid parts). If the former, I think it was good. If the latter, I think it was bad.
(By default I’d assume it’s some mix.)
That sounds a bit unfair, in the sense that it makes it look like I just invented stuff I didn’t believe and ran with it. When what actually happen was that I wrote about my frustrations, but made the mistake of stating them as obvious facts instead of impressions.
Of course, I imagine you feel that my portrayal of EY and MIRI was also unfair, sorry about that.
(I added a note to the three most ranty comments on this thread saying that people should mentally add “I feel like...” to judgments in them.)
Thanks for adding the note! :)
I’m confused. When I say ‘that’s just my impression’, I mean something like ‘that’s an inside-view belief that I endorse but haven’t carefully vetted’. (See, e.g., Impression Track Records, referring to Naming Beliefs.)
Example: you said that MIRI has “contempt with experimental work and not doing only decision theory and logic”.
My prior guess would have been that you don’t actually, for-real believe that—that it’s not your ‘impression’ in the above sense, more like ‘unendorsed venting/hyperbole that has a more complicated relation to something you really believe’.
If you do (or did) think that’s actually true, then our models of MIRI are much more different than I thought! Alternatively, if you agree this is not true, then that’s all I meant in the previous comment. (Sorry if I was unclear about that.)
I would say that with slight caveats (make “decision theory and logic” a bit larger to include some more mathy stuff and make “all experimental work” a bit smaller to not includes Redwood’s work), this was indeed my model.
What made me update from our discussion is the realization that I interpreted the dismissal of basically all alignment research as “this has no value whatsoever and people doing it are just pretending to care on alignment”, where it should have been interpreted as something like “this is potentially interesting/new/exciting, but it doesn’t look like it brings us closer to solving alignment in a significant way, hence we’re still failing”.
‘Experimental work is categorically bad, but Redwood’s work doesn’t count’ does not sound like a “slight caveat” to me! What does this generalization mean at all if Redwood’s stuff doesn’t count?
(Neither, for that matter, does the difference between ‘decision theory and logic’ and ‘all mathy stuff MIRI has ever focused on’ seem like a ‘slight caveat’ to me—but in that case maybe it’s because I have a lot more non-logic, non-decision-theory examples in my mind that you might not be familiar with, since it sounds like you haven’t read much MIRI stuff?).
(Responding to entire comment thread) Rob, I don’t think you’re modeling what MIRI looks like from the outside very well.
There’s a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)
There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.
There was a non-public agenda that involved Haskell programmers. That’s about all I know about it. For all I know they were doing something similar to the modal logic work I’ve seen in the past.
Eliezer frequently talks about how everyone doing ML work is pursuing dead ends, with potentially the exception of Chris Olah. Chris’s work is not central to the cluster I would call “experimentalist”.
There has been one positive comment on the KL-divergence result in summarizing from human feedback. That wasn’t the main point of that paper and was an extremely predictable result.
There has also been one positive comment on Redwood Research, which was founded by people who have close ties to MIRI. The current steps they are taking are not dramatically different from what other people have been talking about and/or doing.
There was a positive-ish comment on aligning narrowly superhuman models, though iirc it gave off more of an impression of “well, let’s at least die in a slightly more dignified way”.
I don’t particularly agree with Adam’s comments, but it does not surprise me that someone could come to honestly believe the claims within them.
So, the point of my comments was to draw a contrast between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” I didn’t intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as ‘oh yeah, that’s not true and wasn’t what I was thinking’) is not great for discussion.
I think it’s true that ‘MIRI is super not into most ML alignment work’, and I think it used to be true that MIRI put almost all of its research effort into HRAD-ish work, and regardless, this all seems like a completely understandable cached impression to have of current-MIRI. If I wrote stuff that makes it sound like I don’t think those views are common, reasonable, etc., then I apologize for that and disavow the thing I said.
But this is orthogonal to what I thought I was talking about, so I’m confused about what seems to me like a topic switch. Maybe the implied background view here is:
‘Adam’s elision between those two claims was a totally normal level of hyperbole/imprecision, like you might find in any LW comment. Picking on word choices like “only decision theory and logic” versus “only research that’s clustered near decision theory and logic in conceptspace”, or “contempt with experimental work” versus “assigning low EV to typical instances of empirical ML alignment work”, is an isolated demand for rigor that wouldn’t make sense as a general policy and isn’t, in any case, the LW norm.’
Is that right?
It occurs to me that part of the problem may be precisely that Adam et al. don’t think there’s a large difference between these two claims (that actually matters). For example, when I query my (rough, coarse-grained) model of [your typical prosaic alignment optimist], the model in question responds to your statement with something along these lines:
Moreover: I don’t think [my model of] the prosaic alignment optimist is being stupid here. I think, to the extent that his words miss an important distinction, it is because that distinction is missing from his very thoughts and framing, not because he happened to use choose his words somewhat carelessly when attempting to describe the situation. Insofar as this is true, I expect him to react to your highlighting of this distinction with (mostly) bemusement, confusion, and possibly even some slight suspicion (e.g. that you’re trying to muddy the waters with irrelevant nitpicking).
To be clear: I don’t think you’re attempting to muddy the waters with irrelevant nitpicking here. I think you think the distinction in question is important because it’s pointing to something real, true, and pertinent—but I also think you’re underestimating how non-obvious this is to people who (A) don’t already deeply understand MIRI’s view, and (B) aren’t in the habit of searching for ways someone’s seemingly pointless statement might actually be right.
I don’t consider myself someone who deeply understands MIRI’s view. But I do want to think of myself as someone who, when confronted with a puzzling statement [from someone whose intellectual prowess I generally respect], searches for ways their statement might be right. So, here is my attempt at describing the real crux behind this disagreement:
(with the caveat that, as always, this is my view, not Rob’s, MIRI’s, or anybody else’s)
(and with the additional caveat that, even if my read of the situation turns out to be correct, I think in general the onus is on MIRI to make sure they are understood correctly, rather than on outsiders to try to interpret them—at least, assuming that MIRI wants to make sure they’re understood correctly, which may not always be the best use of researcher time)
I think the disagreement is mostly about MIRI’s counterfactual behavior, not about their actual behavior. I think most observers (including both Adam and Rob) would agree that MIRI leadership has been largely unenthusiastic about a large class of research that currently falls under the umbrella “experimental work”, and that the amount of work in this class MIRI has been unenthused about significantly outweighs the amount of work they have been excited about.
Where I think Adam and Rob diverge is in their respective models of the generator of this observed behavior. I think Adam (and those who agree with him) thinks that the true boundary of the category [stuff MIRI finds unpromising] roughly coincides with the boundary of the category [stuff most researchers would call “experimental work”], such that anything that comes too close to “running ML experiments and seeing what happens” will be met with an immediate dismissal from MIRI. In other words, [my model of] Adam thinks MIRI’s generator is configured such that the ratio of “experimental work” they find promising-to-unpromising would be roughly the same across many possible counterfactual worlds, even if each of those worlds is doing “experiments” investigating substantially different hypotheses.
Conversely, I think Rob thinks the true boundary of the category [stuff MIRI finds unpromising] is mostly unrelated to the boundary of the category [stuff most researchers would call “experimental work”], and that—to the extent MIRI finds most existing “experimental work” unpromising—this is mostly because the existing work is not oriented along directions MIRI finds promising. In other words, [my model of] Rob thinks MIRI’s generator is configured such that the ratio of “experimental work” they find promising-to-unpromising would vary significantly across counterfactual worlds where researchers investigate different hypotheses; in particular, [my model of] Rob thinks MIRI would find most “experimental work” highly promising in the world where the “experiments” being run are those whose results Eliezer/Nate/etc. would consider difficult to predict in advance, and therefore convey useful information regarding the shape of the alignment problem.
I think Rob’s insistence on maintaining the distinction between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” is in fact an attempt to gesture at the underlying distinction outlined above, and I think that his stringency on this matter makes significantly more sense in light of this. (Though, once again, I note that I could be completely mistaken in everything I just wrote.)
Assuming, however, that I’m (mostly) not mistaken, I think there’s an obvious way forward in terms of resolving the disagreement: try to convey the underlying generators of MIRI’s worldview. In other words, do the thing you were going to do anyway, and save the discussions about word choice for afterwards.
^ This response is great.
I also think I naturally interpreted the terms in Adam’s comment as pointing to specific clusters of work in today’s world, rather than universal claims about all work that could ever be done. That is, when I see “experimental work and not doing only decision theory and logic”, I automatically think of “experimental work” as pointing to a specific cluster of work that exists in today’s world (which we might call mainstream ML alignment), rather than “any information you can get by running code”. Whereas it seems you interpreted it as something closer to “MIRI thinks there isn’t any information to get by running code”.
My brain insists that my interpretation is the obvious one and is confused how anyone (within the AI alignment field, who knows about the work that is being done) could interpret it as the latter. (Although the existence of non-public experimental work that isn’t mainstream ML is a good candidate for how you would start to interpret “experimental work” as the latter.) But this seems very plausibly a typical mind fallacy.
EDIT: Also, to explicitly say it, sorry for misunderstanding what you were trying to say. I did in fact read your comments as saying “no, MIRI is not categorically against mainstream ML work, and MIRI is not only working on HRAD-ish stuff like decision theory and logic, and furthermore this should be pretty obvious to outside observers”, and now I realize that is not what you were saying.
This is a good comment! I also agree that it’s mostly on MIRI to try to explain its views, not on others to do painstaking exegesis. If I don’t have a ready-on-hand link that clearly articulates the thing I’m trying to say, then it’s not surprising if others don’t have it in their model.
And based on these comments, I update that there’s probably more disagreement-about-MIRI than I was thinking, and less (though still a decent amount of) hyperbole/etc. If so, sorry about jumping to conclusions, Adam!
Not sure if this helps, and haven’t read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I’d have are that:
* For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
* For almost all experimental results you would think they were so much less informative as to not be worthwhile.
* There’s a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.
(I’d be willing to take bets on these or pick candidate experiments to clarify this.)
In addition, a consequence of these beliefs is that compared to me you think we should be spending way more time sitting around thinking about stuff, and way less time doing experiments, than I do.
I would agree with you that “MIRI hates all experimental work” / etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.
Ooh, that’s really interesting. Thinking about it, I think my sense of what’s going on is (and I’d be interested to hear how this differs from your sense):
Compared to the average alignment researcher, MIRI tends to put more weight on reasoning like ‘sufficiently capable and general AI is likely to have property X as a strong default, because approximately-X-ish properties don’t seem particularly difficult to implement (e.g., they aren’t computationally intractable), and we can see from first principles that agents will be systematically less able to get what they want when they lack property X’. My sense is that MIRI puts more weight on arguments like this for reasons like:
We’re more impressed with the track record of inside-view reasoning in science.
I suspect this is partly because the average alignment researcher is impressed with how unusually-poorly inside-view reasoning has done in AI—many have tried to gain a deep understanding of intelligence, and many have failed—whereas (for various reasons) MIRI is less impressed with this, and defaults more to the base rate for other fields, where inside-view reasoning has more extraordinary feats under its belt.
We’re more wary of “modest epistemology”, which we think often acts like a self-fulfilling prophecy. (You don’t practice trying to mechanistically model everything yourself, you despair of overcoming biases, you avoid thinking thoughts that would imply you’re a leader or pioneer because that feels arrogant, so you don’t gain as much skill or feedback in those areas.)
Compared to the average alignment researcher, MIRI tends to put less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed’. This is for a variety of reasons, including:
MIRI is more generally wary of putting much weight on surface generalizations, if we don’t have an inside-view reason to expect the generalization to keep holding.
MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.
Relatedly, MIRI thinks AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked. Progress on understanding AGI is much harder to predict than progress on hardware, so we can’t derive as much from trends.
Applying this to experiments:
I’d have the same prediction, though I’m less confident that ‘pessimism about experiments’ is doing much work here, vs. ‘pessimism about alignment’. To distinguish the two, I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you (though probably the gap will be smaller?).
I do expect there to be some experiment-specific effect. I don’t know your views well, but if your views are sufficiently like my mental model of ‘typical alignment researcher whose intuitions differ a lot from MIRI’s’, then my guess would be that the disagreement comes down to the above two factors.
1 (more trust in inside view): For many experiments, I’m imagining Eliezer saying ‘I predict the outcome will be X’, and then the outcome is X, and the Modal Alignment Researcher says: ‘OK, but now we’ve validated your intuition—you should be much more confident, and that update means the experiment was still useful.’
To which Hypothetical Eliezer says: ‘I was already more than confident enough. Empirical data is great—I couldn’t have gotten this confident without years of honing my models and intuitions through experience—but now that I’m there, I don’t need to feign modesty and pretend I’m uncertain about everything until I see it with my own eyes.’
2 (less trust in AGI sticking to trends): For many obvious ML experiments Eliezer can’t predict the outcome of, I expect Eliezer to say ‘This experiment isn’t relevant, because factors X, Y, and Z give us strong reason to think that the thing we learn won’t generalize to AGI.’
Which ties back in to 1 as well, because if you don’t think we can build very reliable models in AI without constant empirical feedback, you’ll rarely be confident of abstract reasons X/Y/Z to expect a difference between current ML and AGI, since you can’t go walk up to an AGI today and observe what it’s like.
(You also won’t be confident that X/Y/Z don’t hold—all the possibilities will seem plausible until AGI is actually here, because you generally don’t trust yourself to reason your way to conclusions with much confidence.)
Thanks. For time/brevity, I’ll just say which things I agree / disagree with:
> sufficiently capable and general AI is likely to have property X as a strong default [...]
I generally agree with this, although for certain important values of X (such as “fooling humans for instrumental reasons”) I’m probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I’m also probably more optimistic (but not certain) that those efforts will succeed.
[inside view, modest epistemology]: I don’t have a strong take on either of these. My main take on inside views is that they are great for generating interesting and valuable hypotheses, but usually wrong on the particulars.
> less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed
I agree, see my post On the Risks of Emergent Behavior in Foundation Models. In the past I think I put too much weight on this type of reasoning, and also think most people in ML put too much weight on it.
> MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.
Probably disagree but hard to tell. I think there will both be a lot of similarities and a lot of differences.
> AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked
Seems pretty wrong to me. We probably need both insight and hardware, but the insights themselves are hardware-bottlenecked: once you can easily try lots of stuff and see what happens, insights are much easier, see Crick on x-ray crystallography for historical support (ctrl+f for Crick).
> I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you
I’m more pessimistic than MIRI about HRAD, though that has selection effects. I’ve found conceptual work to be pretty helpful for pointing to where problems might exist, but usually relatively confused about how to address them or how specifically they’re likely to manifest. (Which is to say, overall highly valuable, but consistent with my take above on inside views.)
[experiments are either predictable or uninformative]: Seems wrong to me. As a concrete example: Do larger models have better or worse OOD generalization? I’m not sure if you’d pick “predictable” or “uninformative”, but my take is:
* The outcome wasn’t predictable: within ML there are many people who would have taken each side. (I personally was on the wrong side, i.e. predicting “worse”.)
* It’s informative, for two reasons: (1) It shows that NNs “automatically” generalize more than I might have thought, and (2) Asymptotically, we expect the curve to eventually reverse, so when does that happen and how can we study it?
See also my take on Measuring and Forecasting Risks from AI, especially the section on far-off risks.
> Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
Would agree with “most”, but I think you probably meant something like “almost all”, which seems wrong. There’s lots of people working on interpretability, and some of the work seems quite good to me (aside from Chris, I think Noah Goodman, Julius Adebayo, and some others are doing pretty good work).
I’m not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count “the same”) had any clear-to-my-model relevance to alignment, or even AGI. AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we’ve got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem. Bigger models generalizing better or worse doesn’t say anything obvious to any piece of my model of the Big Problem. Though if larger models start generalizing more poorly, then it takes longer to stupidly-brute-scale to AGI, which I suppose affects timelines some, but that just takes timelines from unpredictable to unpredictable sooo.
If we qualify an experiment as interesting when it can tell anyone about anything, then there’s an infinite variety of experiments “interesting” in this sense and I could generate an unlimited number of them. But I do restrict my interest to experiments which can not only tell me something I don’t know, but tell me something relevant that I don’t know. There is also something to be said for opening your eyes and staring around, but even then, there’s an infinity of blank faraway canvases to stare at, and the trick is to go wandering with your eyes wide open someplace you might see something really interesting. Others will be puzzled and interested by different things and I don’t wish them ill on their spiritual journeys, but I don’t expect the vast majority of them to return bearing enlightenment that I’m at all interested in, though now and then Miles Brundage tweets something (from far outside of EA) that does teach me some little thing about cognition.
I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.
Do larger models generalize better or more poorly OOD? It’s a relatively basic question as such things go, and no doubt of interest to many, and may even update our timelines from ‘unpredictable’ to ‘unpredictable’, but… I’m trying to figure out how to say this, and I think I should probably accept that there’s no way to say it that will stop people from trying to sell other bits of research as Super Relevant To Alignment… it’s okay to have an understanding of reality which makes narrower guesses than that about which projects will turn out to be very relevant.
Trying to rephrase it in my own words (which will necessarily lose some details), are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now? Which might tell us for example “what aspect of these predictable problems crop up first, and why?”
It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like “What actually went into the nonviolence predicate instead of just nonviolence?” Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.
I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/understandability/etc. in the first AGI systems. Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about ‘looking in the learned model’s brain’ to spot mesa-optimization and analyze what that optimization is ultimately ‘trying to do’ (or whatever), then you probably won’t produce stuff that MIRI’s excited about regardless of how experimental vs. theoretical your work is.
(In which case, maybe this is not actually a crux for the usefulness of most alignment experiments, and is instead a crux for the usefulness of most alignment research in general.)
(I suspect there are a bunch of other disagreements going into this too, including basic divergences on questions like ‘What’s even the point of aligning AGI? What should humanity do with aligned AGI once it has it?’.)
One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)
I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has… disconcerting effects on my own level of optimism.
Not planning to answer more on this thread, but given how my last messages seem to have confused you, here is my last attempt of sharing my mental model (so you can flag in an answer where I’m wrong in your opinion for readers of this thread)
Also, I just checked on the publication list, and I’ve read or skimmed most things MIRI published since 2014 (including most newsletters and blog posts on MIRI website).
My model of MIRI is that initially, there was a bunch of people including EY who were working mostly on decision theory stuff, tiling, model theory, the sort of stuff I was pointing at. That predates Nate’s arrival, but in my model it becomes far more legible after that (so circa 2014/2015). In my model, I call that “old school MIRI”, and that was a big chunk of what I was pointing out in my original comment.
Then there are a bunch of thing that seem to have happened:
Newer people (Abram and Scott come to mind, but mostly because they’re the one who post on the AF and who I’ve talked to) join this old-school MIRI approach and reshape it into Embedded Agency. Now this new agenda is a bit different from the old-school MIRI work, but I feel like it’s still not that far from decision theory and logic (with maybe a stronger emphasis on the bayesian part for stuff like logical induction). That might be a part where we’re disagreeing.
A direction related to embedded agency and the decision theory and logic stuff, but focused on implementations through strongly typed programming languages like Haskell and type theory. That’s technically practical, but in my mental model this goes in the same category as “decision theory and logic stuff”, especially because that sort of programming is very close to logic and natural deduction.
MIRI starts it’s ML-focused agenda, which you already mentioned. The impression I still have is that this didn’t lead to much published work that was actually experimental, instead focusing on recasting questions of alignment through ML theory. But I’ve updated towards thinking MIRI has invested efforts into looking at stuff from a more prosaic angle, based on looking more into what has been published there, because some of these ML papers had flown under my radar (there’s also the difficulty that when I read a paper by someone who has a position elsewhere now — say Ryan Carey or Stuart Armstrong — I don’t think MIRI but I think of their current affiliation, even though the work was supported by MIRI (and apparently Stuart is still supported by MIRI)). This is the part of the model where I expect that we might have very different models because of your knowledge of what was being done internally and never released.
Some new people hired by MIRI fall into what I call the “Bells Lab MIRI” model, where MIRI just hires/funds people that have different approaches from them, but who they think are really bright (Evan and Vanessa come to mind, although I don’t know if that’s the though process that went into hiring them).
Based on that model and some feedback and impressions I’ve gathered from people of some MIRI researchers being very doubtful of experimental work, that lead to my “all experimental work is useless”. I tried to include Redwood and Chris Olah’s work in there with the caveat (which is a weird model but makes sense if you have a strong prior for “experimental work is useless for MIRI”).
Our discussion made me think that there’s probably far better generators for this general criticism of experimental work, and that they would actually make more sense than “experimental work is useless except this and that”.
If you were in the situation described by The Rocket Alignment Problem, you could think “working with rockets right now isn’t useful, we need to focus on our conceptual confusions about more basic things” without feeling inherently contemptuous of experimentalism—it’s a tool in the toolbox (which may or may not be appropriate to the task at hand), not a low- or high-status activity on a status hierarchy.
Separately, I think MIRI has always been pretty eager to run experiments in software when they saw an opportunity to test important questions that way. It’s also been 4.5 years now since we announced that we were shifting a lot of resources away from Agent Foundations and into new stuff, and 3 years since we wrote a very long (though still oblique) post about that research, talking about its heavy focus on running software experiments. Though we also made sure to say:
I don’t think you can say MIRI has “contempt with experimental work” after four years of us mainly focusing on experimental work. There are other disagreements here, but this ties in to a long-standing objection I have to false dichotomies like:
‘we can either do prosaic alignment, or run no experiments’
‘we can either do prosaic alignment, or ignore deep learning’
‘we can either think it’s useful to improve our theoretical understanding of formal agents in toy settings, or think it’s useful to run experiments’
‘we can either think the formal agents work is useful, or think it’s useful to work with state-of-the-art ML systems’
I don’t think Eliezer’s criticism of the field is about experimentalism. I do think it’s heavily about things like ‘the field focuses too much on putting external pressures on black boxes, rather than trying to open the black box’, because (a) he doesn’t think those external-pressures approaches are viable (absent a strong understanding of what’s going on inside the box), and (b) he sees the ‘open the black box’ type work as the critical blocker. (Hence his relative enthusiasm for Chris Olah’s work, which, you’ll notice, is about deep learning and not about decision theory.)
You’ve had a few comments along these lines in this thread, and I think this is where you’re most severely failing to see the situation from Yudkowsky’s point of view.
From Yudkowsky’s view, explaining and justifying MIRI’s work (and the processes he uses to reach such judgements more generally) was the main point of the sequences. He has written more on the topic than anyone else in the world, by a wide margin. He basically spent several years full-time just trying to get everyone up to speed, because the inductive gap was very very wide.
When I put on my Yudkowsky hat and look at both the OP and your comments through that lens… I imagine if I were Yudkowsky I’d feel pretty exasperated at this point. Like, he’s written a massive volume on the topic, and now ten years later a large chunk of people haven’t even bothered to read it. (In particular, I know (because it’s come up in conversation) that at least a few of the people who talk about prosaic alignment a lot haven’t read the sequences, and I suspect that a disproportionate number haven’t. I don’t mean to point fingers or cast blame here, the sequences are a lot of material and most of it is not legibly relevant before reading it all, but if you haven’t read the sequences and you’re wondering why MIRI doesn’t have a write-up on why they’re not excited about prosaic alignment… well, that’s kinda the write-up. Also I feel like I need a disclaimer here that many people excited about prosaic alignment have read the sequences, I definitely don’t mean to imply that this is everyone in the category.)
(To be clear, I don’t think the sequences explain all of the pieces behind Yudkowsky’s views of prosaic alignment, in depth. They were written for a different use-case. But I do think they explain a lot.)
Related: IMO the best roughly-up-to-date piece explaining the Yudkowsky/MIRI viewpoint is The Rocket Alignment Problem.
Thanks for the pushback!
My memory of the sequences is that it’s far more about defending and explaining the alignment problem than criticizing prosaic AGI (maybe because the term couldn’t have been used years before Paul coined it?). Could you give me the best pointers of prosaic Alignment criticism in the sequence? I(I’ve read the sequences, but I don’t remember every single post, and my impression for memory is what I’ve written above).
I feel also that there might be a discrepancy between who I think of when I think of prosaic alignment researchers and what the category means in general/to most people here? My category mostly includes AF posters, people from a bunch of places like EleutherAI/OpenAI/DeepMind/Anthropic/Redwood and people from CHAI and FHI. I expect most of these people to actually have read the sequences, and tried to understand MIRI’s perspective. Maybe someone could point out a list of other places where prosaic alignment research is being done that I’m missing, especially places where people probably haven’t read the sequences? Or maybe I’m over estimating how many of the people in the places I mentioned have read the sequences?
I don’t mean to say that there’s critique of prosaic alignment specifically in the sequences. Rather, a lot of the generators of the Yudkowsky-esque worldview are in there. (That is how the sequences work: it’s not about arguing specific ideas around alignment, it’s about explaining enough of the background frames and generators that the argument becomes unnecessary. “Raise the sanity waterline” and all that.)
For instance, just the other day I ran across this:
(The earlier part of the post had a couple embarrassing stories of mistakes Yudkowsky made earlier, which is where the lesson came from.) Reading that, I was like, “man that sure does sound like the Yudkowsky-esque viewpoint on prosaic alignment”.
I think you are overestimating. At the orgs you list, I’d guess at least 25% and probably more than half have not read the sequences. (Low confidence/wide error bars, though.)
Thank you for the links Adam. To clarify, the kind of argument I’m really looking for is something like the following three (hypothetical) examples.
Mesa-optimization is the primary threat model of unaligned AGI systems. Over the next few decades there will be a lot of companies building ML systems that create mesa-optimizers. I think it is within 5 years of current progress that we will understand how ML systems create mesa-optimizers and how to stop it.Therefore I think the current field is adequate for the problem (80%).
When I look at the research we’re outputting, it seems to me to me that we are producing research at a speed and flexibility faster than any comparably sized academic department globally, or the ML industry, and so I am much more hopeful that we’re able to solve our difficult problem before the industry builds an unaligned AGI. I give it a 25% probability, which I suspect is much higher than Eliezer’s.
I basically agree the alignment problem is hard and unlikely to be solved, but I don’t think we have any alternative than the current sorts of work being done, which is a combo of (a) agent foundations work (b) designing theoretical training algorithms (like Paul is) or (c) directly aligning narrowly super intelligent models. I am pretty open to Eliezer’s claim that we will fail but I see no alternative plan to pour resources into.
Whatever you actually think about the field and how it will save the world, say it!
It seems to me that almost all of your the arguments you’ve made work whether the field is a failure or not. The debate here has to pass through whether the field is on-track or not, and we must not sidestep that conversation.
I want to leave this paragraph as social acknowledgment that you mentioned upthread that you’re tired and taking a break, and I want to give you a bunch of social space to not return to this thread for however long you need to take! Slow comments are often the best.
Thanks for the examples, that helps a lot.
I’m glad that I posted my inflammatory comment, if only because exchanging with you and Rob made me actually consider the question of “what is our story to success”, instead of just “are we making progress/creating valuable knowledge”. And the way you two have been casting it is way less aversive to me that the way EY tends to frame it. This is definitely something I want to think more about. :)
Appreciated. ;)
Glad to hear. And yeah, that’s the crux of the issue for me.
! Yay! That’s really great to hear. :)
I’m sympathetic to most of your points.
I have sympathy for the “this feels somewhat contemptuous” reading, but I want to push back a bit on the “EY contemptuously calling nearly everyone fakers” angle, because I think “[thinly] veiled contempt” is an uncharitable reading. He could be simply exasperated about the state of affairs, or wishing people would change their research directions but respect them as altruists for Trying At All, or who knows what? I’d rather not overwrite his intentions with our reactions (although it is mostly the job of the writer to ensure their writing communicates the right information [although part of the point of the website discussion was to speak frankly and bluntly]).
If superintelligence is approximately multimodal GPT-17 plus reinforcement learning, then understanding how GPT-3-scale algorithms function is exceptionally important to understanding super-intelligence.
Also, if superintelligence doesn’t happen then prosaic alignment is the only kind of alignment.
Why do you think this? On the original definition of prosaic alignment, I don’t see why this would be true.
(In case it clarifies anything: my understanding of Paul’s research program is that it’s all about trying to achieve prosaic alignment for superintelligence. ‘Prosaic’ was never meant to imply ‘dumb’, because Paul thinks current techniques will eventually scale to very high capability levels.)
My thinking is that prosaic alignment can also apply to non-super intelligent systems. If multimodal GPT-17 + RL = superintelligence, then whatever techniques are involved with aligning that system would probably apply to multimodal GPT-3 + RL, despite not being superintelligence. Superintelligence is not a prerequisite for being alignable.
This is already reflected in the upvotes, but just to say it explicitly: I think the replies to this comment from Rob and dxu in particular have been exceptionally charitable and productive; kudos to them. This seems like a very good case study in responding to a provocative framing with a concentration of positive discussion norms that leads to productive engagement.
Just in case anyone hasn’t already seen these, EY wrote Challenges to Christiano’s capability amplification proposal and this comment (that I already linked to in a different comment on this page) (also has a reply thread), both in 2018. Also The Rocket Alignment Problem.
Context for anyone who’s not aware:
Here’s the xkcd comic which coined the term.
Thanks, I sometimes forget not everyone knows the term. :)
Strong upvote.
My original exposure to LW drove me away in large part because issues you describe. I would also add (at least circa 2010) you needed to have a near-deistic belief in the anti-messianic emergence of some AGI so powerful that it can barely be described in terms of human notions of “intelligence.”