Hashing out long-standing disagreements seems low-value to me
(Status: a short write-up of some things that I find myself regularly saying in person. In this case, I’m writing up my response to the question of why I don’t spend a bunch more time trying to resolve disagreements with people in the community who disagree with me about the hopefulness of whatever research direction. I’m not particularly happy with it, but it’s been languishing in my draft folder for many months now and published is better than perfect.)
When I first joined the AI alignment community almost ten years ago, there were lots of disagreements—between groups like MIRI and Open Phil, between folks like Eliezer Yudkowsky and Paul Christiano, etc. At that time, I was optimistic about resolving a bunch of those disagreements. I invested quite a few hours in this project, over the years.
I didn’t keep track exactly, but extremely roughly, I think the people with very non-MIRI-ish perspectives I spent the most time trying to converge with (including via conversation, reading and writing blog posts, etc.) were:
Paul Christiano (previously at OpenAI, now at ARC): 100 hours? (Maybe as low as 50 or as high as 200.)
Daniel Dewey (then at Open Phil): 40 hours? (Possibly 100+.)
Nick Beckstead (then at Open Phil): 30 hours?
Holden Karnofsky (Open Phil): 20 hours?
Tom Davidson (Open Phil): 15 hours?
Another non-MIRI person I’ve spent at least a few hours trying to sync with about AI is Rohin Shah at DeepMind.
(Note that these are all low-confidence ass numbers. I have trouble estimating time expenditures when they’re spread across days in chunks that are spread across years, and when those chunks blur together in hindsight. Corrections are welcome.)
I continue to have some conversations like this, but my current model is that attempting to resolve older and more entrenched disagreements is not worth the time-cost.
It’s not that progress is impossible. It’s that we have a decent amount of evidence of what sorts of time-investment yield what amounts of progress, and it just isn’t worth the time.
On my view, Paul is one of the field’s most impressive researchers. Also, he has spent lots of time talking and working with MIRI researchers, and trying to understand our views.
If even Paul and I can’t converge that much over hundreds of hours, then I feel pretty pessimistic about the effects of a marginal hour spent trying to converge with other field leaders who have far less context on what MIRI-ish researchers think and why we think it. People do regularly tell me that I’ve convinced them of some central AI claim or other, but it’s rarely someone whose views are as distant from mine as Paul’s are, and I don’t recall any instance of it happening on purpose (as opposed to somebody cool who I didn’t have in mind randomly approaching me later to say “I found your blog post compelling”).
And I imagine the situation is pretty symmetric, at this level of abstraction. Since I think I’m right and I think Paul’s wrong, and we’ve both thought hard about these questions, I assume Paul is making some sort of mistake somewhere. But such things can be hard to spot. From his perspective, he should probably view me as weirdly entrenched in my views, and therefore not that productive to talk with. I suspect that he should at least strongly consider this hypothesis, and proportionally downgrade his sense of how useful it is to spend an hour trying to talk some sense into me!
As long as your research direction isn’t burning the commons, I recommend just pursuing whatever line of research you think is fruitful, without trying to resolve disagreements with others in the field.
Note that I endorse writing up what you believe! Articulating your beliefs is an important tool for refining them, and stating your beliefs can also help rally a variety of readers to join your research, which can be enormously valuable. Additionally, I think there’s value in reminding people that your view exists and reminding them to notice when observations support the view versus undermining it.
I’m not saying “don’t worry about making arguments, just do the work”; arguments have plenty of use. What I’m saying is that the use of arguments is not in persuading others already entrenched in their views; the use of arguments lies elsewhere.
(Hyperbolic summary: you can convince plenty of people of things, you just can’t convince particular people of particular things. An argument crafted for a particular person with a very different world-view won’t convince that person, but it might convince bystanders.)
Also, to be clear, I think there’s benefit in occasionally clashing with people with very different views. This can be useful for identifying places where your own arguments are weak and/or poorly articulated, and for checking whether others are seeing paths-to-victory that you’ve missed. (And they sometimes generate artifacts that are useful for rallying other researchers.)
I continue to do this occasionally myself. I just don’t think it’s worth doing with the goal of converging with the person I’m discussing with, if they’re a smart old-timer who’s consistently disagreed for a long time; that’s an unrealistic goal.
I still think that people with entrenched views can undergo various small shifts in their views by talking to other entrenched people from other schools of thought. If you’re trying to get more of those small shifts, it may be well worth having more conversations like that; and small shifts may add up to bigger updates over time.
The thing I think you don’t get is the naive “just close people in a room for however long it takes until they come out of agreeing” that I sort of hoped was possible in 2014, and that I now think does not happen on the order of spending-weeks-together.
You might argue that the hours I invested in bridging some of the gaps between the wildly-different worldviews in this community were spent foolishly and inefficiently, and that someone more skillful may have more luck than me. That’s surely the case for many of the hours; I’m no master of explaining my views.
I’ll note, however, that I’ve tried a variety of different tools and methods, from raw arguing to facilitated discussion to slow written back-and-forth to live written debate. In my experience, none of it seems to appreciably increase the rate at which long-standing disagreements get resolved.
Also, naive as some of my attempts were, I don’t expect my next attempts to go significantly better. And I’ve seen a number of others try to bridge gaps within the field where I failed, and I have yet to see promising results from anyone else either.
Resolving entrenched disagreements between specific people just is not that easy; the differences in world-view are deep and fractal; smart established thinkers in the field already agree about most things that can be cheaply tested.
I’d love to be challenged on this count. If you think I’m being an idiot and ignoring an obvious way to resolve lots of major entrenched disagreements within the field, feel free to give it yet another shot. But until the counterevidence comes in, I plan to continue saying what I believe, but not much worrying about trying to bridge the existing gaps between me and specific individuals I’ve already talked to a lot, like Paul.
(For the record, this has been my stance since about 2017. For most of the intervening five years, I considered it also not really worth the time to write up a bunch of my beliefs, preferring to do direct alignment research. But my estimate of the benefits from my direct alignment research have fallen, so here we are!)
- And All the Shoggoths Merely Players by 10 Feb 2024 19:56 UTC; 160 points) (
- My AI Alignment Research Agenda and Threat Model, right now (May 2023) by 28 May 2023 3:23 UTC; 25 points) (
- Does LessWrong make a difference when it comes to AI alignment? by 3 Jan 2024 12:21 UTC; 21 points) (
- My Central Alignment Priority (2 July 2023) by 3 Jul 2023 1:46 UTC; 12 points) (
- If I Was An Eccentric Trillionaire by 9 Aug 2023 7:56 UTC; 9 points) (
- My AI Alignment Research Agenda and Threat Model, right now (May 2023) by 28 May 2023 3:23 UTC; 6 points) (EA Forum;
- 8 Dec 2023 21:34 UTC; 6 points) 's comment on AI #41: Bring in the Other Gemini by (
- 26 Dec 2023 22:41 UTC; 4 points) 's comment on Current AIs Provide Nearly No Data Relevant to AGI Alignment by (
- 12 Apr 2023 18:34 UTC; 3 points) 's comment on Discussion with Nate Soares on a key alignment difficulty by (
I think that the above is also a good explanation for why many ML engineers working on AI or AGI don’t see any particular reason to engage with or address arguments about high p(doom).
When from a distance one views a field that:
Has longstanding disagreements about basic matters
Has theories—but many of the theories have not resulted in really any concrete predictions that differentiate from standard expectations, despite efforts to do so.
Will continue to exist regardless of how well you criticize any one part of it.
There’s basically little reason to engage with it. These are all also evidence that there’s something epistemically off with what is going on in the field.
Maybe this evidence is wrong! But I do think that it is evidence, and not-weak evidence, and it’s very reasonable for a ML engineer to not deeply engage with arguments because of it.
I don’t think that’s it, because I don’t think the situation in AI Alignment is all that unusual. “Science progresses one funeral at a time” is a universally applied adage. There’s also an even more general “common wisdom” that people in a debate ~never succeed in changing each others’ minds, and the only point of having a debate is to sway the on-the-fence onlookers — that debates are a performance art.
There’s something fundamentally off in human psyche that invalidates Aumann’s for us.
And put like this, I think the capabilities researchers will forever disagree with the alignment researchers for this exact same reason.
It might be the case that it’s because of a more universal thing. Like sometimes time is just necessary for science to progress. And definitely the right view of debate is of changing the POV of onlookers, not the interlocutors.
But—I still suspect, without being able to quantify, that alignment is worse than the other sciences in that the standards by-which-people-agree-what-good-work-is are just uncertain.
People in alignment sometimes say that alignment is pre-paradigmatic. I think that’s a good frame—I take it to mean that the standards of what qualifies as good work themselves not yet ascertained, among many other things. I think that if paradigmaticity is a line with math on the left and, like… pre-atomic chemistry all the way on the right, alignment is pretty far to the right. Modern RL is further to the left, and modern supervised learning with transformers much further to the left, followed up by things for which we actually have textbooks which don’t go out of date every 12 months.
I don’t think this would be disputed? But this really means that it’s almost certain at some point that > 80% of alignment-related-intellectual-output will be tossed at some point in the future, because that’s what pre paradigmaticity means. (Like, 80% is arguably a best case scenario for preparadigmatic fields!) Which means in turn that engaging with it is really a deeply unattractive prospect.
I guess what I’m saying is that I agree that the situation for alignment is not at all bad for a pre-paradigmatic field, but if you call your field preparadigmatic that seems like a pretty bad place to be in, in term of what kind of credibility well-calibrated observers should accord you.
Edit: And like, to the degree that arguments that p(doom) is high are entirely separate from the field of alignment, this is actually a reason for ML engineers to care deeply about alignment, as a way of preventing doom, even if it is preparadigmatic! But I’m quite uncertain that this is true.
Noting that I don’t dispute this.
An important reason why this is true is because existential risk prevention can’t be an experimental field. Some existential risks—such as asteroid impacts—can be understood with strong theory (like, settled physics). AI risk isn’t one of those (and any path by which it could become one of those depends on an inferential leap which is itself uncertain, namely, extrapolating results from near-term AI experiments, to much more powerful AI systems).
Yeah, I agree with that.
I do want to argue against the theory that science progresses one funeral at a time:
Fair enough, I suppose I don’t actually have a vast body of rigorous evidence in favour of that phrase.
Yeah, I’m starting to get a little queasy at the epistemics of AI Alignment, and I’m also getting concerned that our epistemic house isn’t in order.
Depending on what you mean by “any one part of it”, I think 3 is false. E.g., a sufficiently good critique of “AGI won’t just have human-friendly values by default” would cause MIRI to throw a party and close up shop.
Huh, roll to disbelieve on ‘sufficient to close up shop’?. I don’t think this is my only crux for AI being really dangerous.
Even if sufficiently advanced AGI reliably converges to human-friendly values in a very strong sense (i.e. two rival humans trying to build AGIs for war, or many humans with many AGIs embarking on complex economic goals, will somehow always figure out the best things for humans even if it means disobeying orders by stupid humans)...
...there’s still a separate case to be made multipolar narrow non-fully-superhuman AIs won’t kill us before the AGI sovereign fixes everything.
I think a more likely thing we’d want to stick around to do in that world is ‘try to accelerate humanity to AGI ASAP’. “Sufficiently advanced AGI converges to human-friendly values” is weaker than “AGI will just have human-friendly values by default”.
Well, that’s just not true.
My worry is selection effects distort it such that even if the true evidence of doom is less than LW thinks, the fact that we are always exposed to evidence for doom distorts the fact that there may be not that much evidence for doom.
For example, if there are 50 evidence units for doom and 500 evidence units against doom, but LW has found 25 evidence units compared to 5 evidence units for against doom, that’s a selection effect at work.
My worry is something similar may be happening for AI risk.
And the value of knowing we’re wrong is far more than if we are epistemically correct.
Why do you think this?
Basically the fact LW has far more arguments for “alignment will be hard” compared to alignment being easy is the selection effect I’m talking about.
I was also worried because ML people don’t really think that AGI poses an existential risk, and that’s evidence, in an Aumann sense.
Now I do think this is explainable, but other issues remain:
That could either be ‘we’re selecting for good arguments, and the good arguments point toward alignment being hard’, or it could be a non-epistemic selection effect.
Why do you think it’s a non-epistemic selection effect? It’s easier to find arguments for ‘the Earth is round’ than ‘the Earth is flat’, but that doesn’t demonstrate a non-epistemic bias.
… By ‘an Aumann sense’ do you just mean ‘if you know nothing about a brain, then knowing it believes P is some Bayesian evidence for the truth of P’? That seems like a very weird way to use “Aumann”, but if that’s what you mean then sure. It’s trivial evidence to anyone who’s spent much time poking at the details, but it’s evidence.
Basically, it means that the fact that other smart people working in ML/AI doesn’t agree with LW is itself evidence that LW is wrong, since rational reasoner’s updating towards the same priors should see disagreements lessen until there isn’t disagreement, at least in the case where there is only 1 objective truth.
Now I do think this is explainable as a case where vast incentives for capabilities researchers to adopt the position that it isn’t an existential risk, given the potential power and massive impact of AGI benefits them.
I definitely agree that this alone doesn’t only suggest that LW isn’t doing it’s epistemics well or that there is a problematic selection effect.
My worries re LW epistemics are the following:
There’s a lot more theory than empirical evidence, while this is changing for the better, theory being so predominant in LW culture is a serious problem, as theory can easily get out of reality and model poorly.
Comparing AI will come to the idea that the world is round is sort of insane, as the fact that the world is round has both some theoretical evidence and massive empirical evidence, and LW on AI is nowhere close to the evidence for AI catastrophe. It also isn’t necessary.
More specifically, the outside view suggests that AI takeoff is probably slower and most importantly, it suggests that catastrophe from technology has a low prior, given massive amounts of claims of impending doom/catastrophe don’t ever occur, and catastrophe potential from nukes is almost certainly overrated, which tells us we should have much lower priors and suggests thats humanity dying out is actually hard.
I’m not asking LWers to totally change their views, but to have more uncertainty in their estimates of AI risk.
I’m not much moved by these types of arguments, essentially because (in my view) the level of meta at which they occur is too far removed from the object level. If you look at the actual points your opponents lay out, and decide (for whatever reason) that you find those points uncompelling… that’s it. Your job here is done, and the remaining fact that they disagree with you is, if not explained away, then at least screened off. (And to be clear, sometimes it is explained away, although that happens mostly with bad arguments.)
Ditto for outside view arguments—if you’ve looked at past examples of tech, concluded that they’re dissimilar from AGI in a number of ways (not a hard conclusion to reach), and moreover concluded that some of those dissimilarities are strategically significant (a slightly harder conclusion, and one that some people stumble before reaching—but not, ultimately, that hard), then the base rates of the category being outside-viewed no longer contain any independently relevant information, which means that—again—your job here is done.
(I’ve made comments to similar effect in the past, and plan to continuing trumpeting this horn for as long as the meme to which it is counter continues to exist.)
This does, of course, rely on your own reasoning to be correct, in the sense that if you’re wrong, well… you’re wrong. But of course, this really isn’t a particularly special kind of situation: it’s one that recurs all across life, in all kinds of fields and domains. And in particular, it’s not the kind of situation you should cower away from in fear—not if your goal is actually grasping the reality of the situation.
***
And finally (and obviously), all of this only applies to the person making the updates in the first place (which is why, you may notice, everything above the asterisks seems to inhabit the perspective of someone who believes they understand what’s happening, and takes for granted that it’s possible for them to be right as well as wrong). If you’re not in the position of such an individual, but instead conceive of yourself as primarily a third party, an outsider looking in...
...well, mostly I’d ask what the heck you’re doing, and why you aren’t either (1) trying to form your own models, to become one of the People Who Can Get Things Right As Well As Wrong, or—alternatively—(2) deciding that it’s not worth your time and effort, either because of a lack of comparative advantage, or just because you think the whole thing is Likely To Be Bunk.
It kind of sounds like you’re on the second path—which, to be clear, is totally fine! One of the predictable consequences of Daring to Disagree with Others is that Other Others might look upon you, notice that they can’t really tell who’s right from the outside, and downgrade their confidence accordingly. That’s fine, and even good in some sense: you definitely don’t want people thinking they ought to believe something even in [what looks to them like] the absence of any good arguments for it; that’s a recipe for irrationality.
But that’s the whole point, isn’t it—that the perspectives of the Insider, the Researcher Trying to Get At the Truth, and the Outsider, the Bystander Peering Through the Windows—will not look identical, and for obvious reason: they’re different people standing in different (epistemic) places! Neither one of them should agonize about the fact that the former has a tighter probability distribution than the latter; that’s what happens when you proceed further down the path—ideally the right path, but any path has the same property: that your probability distribution narrows as you go further down, and your models become more specific and more detailed.
So go ahead and downgrade your assessment of “LW epistemics” accordingly, if that’s what you’ve decided is the right thing to do in your position as the outsider looking in. (Although I’d argue that what you’d really want is to downgrade your assessment of MIRI, instead of LW as a whole; they’re the most extreme ones in the room, after all. For the record, I think this is Pretty Awesome, but your mileage may vary.) But don’t demand that the Insider be forced to update their probability distribution to match yours—to widen their distribution, to walk back the path they’ve followed in the course of forming their detailed models—simply because you can’t see what [they think] they’re seeing, from their vantage point!
Those people are down in the trenches for a reason: they’re investigating what they see as the most likely possibilities, and letting them do their work is good, even if you think they haven’t justified their (seeming) confidence level to your satisfaction. They’re not trying to.
(Oh hey, I think that has something to do with the title of the post we’re commenting on.)
Thank you for answering, and I now get why narrower probability distributions are there for the inside view
I hope that no matter what really is the truth, that LWers continue on, but always making sure that they are careful in how well their epistemics are.
I’d like to see a few more conversations that explicitly follow CFAR’s Double Crux procedure.
How many of the ~150 hours with Christiano used the Double Crux procedure or had other involvement with CFAR staff? I know that CFAR had some controversies, but that has no bearing on the fact that alignment researchers have a different skillset from CFAR, and even CFAR staff are only as good as the rationality techniques that have already been discovered.
Did you set up termination/dayslong breaks so that either party could privately have space to change their mind, with minimal loss of face/reputation? Did you set up plausible deniability so that choosing to take a break is not seen as an admission of defeat? Did you set up to something like “I was wrong” insurance where the winner of the debate pays the loser, or commits to never absorbing/dismantling their faction, or setting up a long buffer period before notifying everyone?
Even when these discussions don’t produce agreement, do you think they’re helpful for the community?
I’ve spoken to several people who have found the MIRI dialogues useful as they enter the field, understand threat models, understand why people disagree, etc. It seems not-crazy to me that most of the value in these dialogues comes from their effects on the community (as opposed to their effects on the participants).
Three other thoughts/observations:
I’d be interested in seeing a list of strategies/techniques you’ve tried and offering suggestions. (I believe that you’ve tried a lot of things, though).
I was surprised at how low the hour estimates were, particularly for the OP people (especially Holden) and even for Paul. I suppose the opportunity cost of the people listed is high, but even so, the estimates make me think “wow, seems like not that much concentrated time has been spent trying to resolve some critical disagreements about the world’s most important topic” (as opposed to like “oh wow, so much time has been spent on this”).
How have discussions gone with some of the promising newer folks? (People with like 1-3 years of experience). Perhaps these discussions are less fruitful than marginal time with Paul (because newer folks haven’t thought about things for very long), but perhaps they’re more fruitful (because they have different ways of thinking, or maybe they just have different personalities/vibes).
Maybe worth keeping in mind that Nate isn’t the only MIRI person who’s spent lots of hours on this (e.g., Eliezer and Benya have as well), and the numbers only track Nate-time.
Also maybe worth keeping in mind the full list of things that need doing in the world. This is one of the key important leveraged things that needs doing, so it’s easy to say “spend more time on it”. But spending a thousand hours (so, like, a good chunk of a year where that’s the only thing Nate does?) trying to sync up with a specific individual, means that all the other things worth doing with that time can’t happen, like:
spending more focused time on the alignment problem, and trying to find new angles of attack that seem promising.
spending more time sharing models of the alignment problem so others can better attack the problem. (Models that may not be optimized for converging with Paul or Dario or others, but that are very useful for the many people with more Nate-ish technical intuitions.)
spending more time sharing models of the larger strategic landscape (that aren’t specifically optimized for converging with Paul etc.).
mindset interventions, like recruiting more people to the field with security mindset, or trying to find ways to instill security mindset in more people in the field.
work on making more operationally adequate institutions exist.
work on improving coordination between existing institutions.
work on building capacity such that if and when people come up with good alignment experiments to run, we’re able to quickly and effectively execute.
Some of those things seem easier to make fast progress on, or have more levers that haven’t been tried yet, and there are only so many hours to split between them. When you have tried something a lot and haven’t made much progress on it, at some point you should start making the update that it’s not so tractable, and change your strategy in response.
(At least, not so tractable right now. Another possible reason to put off full epistemic convergence is that it may be easier when the field is more mature and more of the arguments can pass through established formalisms, as opposed to needing to pass through fuzzy intuitions and heuristics. Though I don’t know to what extent that’s how Nate is thinking about the thing.)
IMO having and releasing those dialogues was one of the most obviously useful things MIRI has done to date, and I’m super happy with them. (Not speaking for Nate here.) I’m hoping we continue to do more stuff like that in the future.
Working at a startup made me realize how little we can actually “reason through” things to get to a point where all team members agree. Often there’s too little time to test all assumptions, if it’s even doable at all. Part of the role of the CEO is to “cut” these discussions when it’s evident that spending more time on it is worse than proceeding despite uncertainty. If we had “the facts”, we might find it easier to agree. But in an uncertain environment, many decisions come down to the intuition (hopefully based on reliable experience—such as founding a similar previous startup)
To me it seems that there are parallels here. In discussions I can always push back on the intuitions of others, but where no reliable facts exist, I have little chance at getting far. Which is not always bad, since if we couldn’t function well under uncertainty, we would likely be a lot less successful as a species.
It seems worth utilizing appetites here. Ie. “I have an appetite of about 10 hours for this conversation”.
I’ve got an alternate proposal for what ‘increasing agreement over long-outstanding points of disagreement’ should look like. I think the emphasis should be on encouraging both parties to agree on and publish lists of what empirical evidence, ideally focused on agreeing about design of potential experiments would help settle disagreements one way or another. The people having the agreements can then offload the need to evaluate the cost of and run the experiments. Disagreements can be tabled until new evidence arrives, and then discussion can happen once new evidence is presented. People can independently search out coincidental revelations of evidence amongst research not targeted at the specific disagreements. I think in many cases new evidence is (and should be) the primary driver of scientific opinion change.
Mostly agree—my gears-level model is the conversations listed tend to hit Limits to Legibility constraints, and marginal returns drop to very low.
For people interested in something like “Double Crux” on what’s called here “entrenched views”, in my experience what has some chance of working is getting as much understanding as possible in one mind, and then attempting to match the ontologies and intuitions. (I had some success in this and “Drexlerian” vs “MIRIesque” views)
Have you tried making Paul draw a graph of his expected capability level vs alignment level? How much of this was with Rob as mediator? And in general what was the result of these conversations, did you manage to determine some core disagreements?
I don’t think there is enough disagreement in the field to gather such evidence without your participation.
Do you think that the cause of the disagreements is mostly emotional or mostly factual?
Emotional being something like someone not wanting to be convinced of something that will raise their pdoom by a lot. This can be on a very subconscious level.
Factual being that they honestly just don’t agree, all emotions aside.
So yeah, I’m asking what you think is “mostly” the reason.
Have you tried, at the beginning of debates, establishing an agreement with your interlocutor to measure progress, explicitly or not, in units of “I hadn’t considered that before” or “I wasn’t considering that when I had my idea”? If you start noticing that there is a common theme to the things which had not been considered before, or in advance of developing the idea, it suggests general areas of improvement.
When people respond to arguments, a somewhat opaque impression is left as to whether they are coming up with counter-arguments on the spot or if those arguments were truly a part of the ideation process from the beginning.
If people keep failing considerations-in-mind litmus tests, you tell them to stop what they’re doing and start over. If they keep passing litmus tests you had in mind (or that you spontaneously notice you should have been keeping in mind), you should worry about your failure to adopt their idea.
Is it pointless in general to hash out long standing disagreements, or is there something especially difficult about AI safety?
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Can you give an example of where you disagree with Paul?