Debate: Is it ethical to work at AI capabilities companies?

Epistemic status: Soldier mindset. These are not (necessarily) our actual positions, these are positions we were randomly assigned, and for which we searched for the strongest arguments we could find, over the course of ~1 hr 45 mins.

Sides: Ben was assigned to argue that it’s ethical to work for an AI capabilities company, and Lawrence was assigned to argue that it isn’t.

Reading Order: Ben and Lawrence drafted each round of statements simultaneously. This means that each of Lawrence statements you read were written without Lawrence having read Ben’s statements that are immediately proceeding.

Ben’s Opening Statement

Ben Pace

I think it is sometimes ethical to work for AI capabilities companies.

I think the standard argument against my position is:

  1. It is bad to cause an existential failure for humanity.

  2. For these companies, we have a direct mechanism by which that may happen.

  3. Therefore, you should not work for them.

(I am granting that we both believe they are risking either extinction or takeover by an ~eternal alien dictatorship.)

I think this argument is strong, but I argue that there are some exceptions.

I think that there are still four classes of reason for joining.

First, you have a concrete mechanism by which your work there may prevent the existential threat. For instance, there were people in the Manhattan project improving the safety of the atom bomb engineering, and I think they had solid mechanistic reasons to think that they could substantially improve the safety. I’m not sure what probability to assign, because I think probabilities-without-mechanisms are far more suspect and liable to be meaningless, but if you have a concrete story for averting the threat, I think even a 10% chance is sufficient.

Second, I have often felt it is important to be able to join somewhat corrupt institutions, and mitigate the damage. I sometimes jokingly encourage my friends at IMO corrupt institutions to rise through the ranks and then dramatically quit when something unusually corrupt happens, and help start the fire for reform. I recall mentioning this to Daniel Kokotajlo who’s resignation was ultimately very influential (to be clear I don’t believe my comment had any impact on what happened). That said, I think most random well-intentioned people should not consider themselves strong enough to withstand the forces of corruption that will assault them on the inside of such an organization, and think many people are stupidly naive about this. (I think there’s a discussion to be had on what the strength of the forces are here and what kinds of forces and strengths you need to be able to withstand, and perhaps Lawrence can argue that it’s quite low or easy to overcome.)

Third, I think if you are in a position to use the reputation from being in a senior role to substantially improve the outsider’s understanding of the situation, then I could consider this worth it. I can imagine a world where I endorsed the creation of a rival AI company worth $10Bs if a person in a leadership position would regularly go out to high profile platforms and yell “Our technology will kill us all! I am running at this to try to get a bit of control, but goddammit you have to stop us. Shut it all down or we are all doomed. Stop us!” I would genuinely consider it reasonable to work for such an existential-risk-producing-organization, and if you’re in a position to be that person (e.g. a senior person who can say this), then I think I may well encourage you to join the fray.

Fourth, there is an argument that even if you cannot see how to do good, and even if you cannot speak openly about it, it may make sense to stay in power rather than choose to forgo it, even if you increase the risk of extinction or eternal totalitarian takeover. I am reluctant to endorse this strategy on principle, but my guess is that there are worlds dark enough where simply getting power is worthwhile. Hm, my actual guess is that the correct policy is to not do this in situations that risk extinction and takeover, only in situations with much lesser risks such as genocide or centuries-long totalitarian takeover, but I am unsure. I think that the hardest thing here is that this heuristic can cause a very sizeable amount of the people working on the project to be people who spiritually oppose it — imagine if 50% of the Nazis were people broadly opposed to Nazism and genocide but felt that it was better for there to be opposition inside the organization, this would obviously be the wrong call. I think this makes sense for a handful of very influential people to do, and you should execute an algorithm that avoids 5% of the resources of the organization to come from those who politically oppose it. (There are also many other reasons relating to honesty and honor to not work at an organization whose existence you politically oppose.)

In summary, my position is that there are sometimes cases where the good and right thing to do is to work as part of an effort that is on a path to cause human extinction or an eternal totalitarian takeover. They are: if you believe you have a good plan for how to prevent it, if you believe you are able to expose the corruption, if you are able to substantially improve outsiders’ understanding of the bad outcomes being worked toward, and sometimes just so that anyone from a faction opposed to extinction and eternal totalitarian takeover has power.

Lawrence’s Opening Statement

LawrenceC

No, it’s not ethical to work at AI companies.

Here’s my take:

  1. First, from a practical utilitarian perspective, do we expect the direct marginal effect of someone joining an AI scaling lab to be positive? I argue the answer is pretty obviously no.

    1. The number of people is clearly too high on the margin already due to financial + social incentives, so it’s unlikely

    2. The direct impact of your work is almost certainly net negative. (Per unit person at labs, this is true on average.)

    3. Incentives (both financial and social) prohibit clear thinking.

      1. Many things that might be good become unthinkable (e.g. AI pauses)

  2. Second, are there any big categories of exceptions? I argue that there are very few exceptions to the above rule.

    1. Direct safety work is probably not that relevant and likely to just be capabilities anyways, plus you can probably do it outside of a lab.

      1. Financial and other incentives strongly incentivize you to do things of commercial interest; this is likely just capabilities work.

      2. Lots of safety work is ambiguous, easy to round it one way or the other.

    2. Gaining power with hopes to use it is not a reliable strategy, almost no one remains incorruptible

      1. It’s always tempting to bide your time and wait for the next chance.

      2. Respectability is in large part a social fiction that binds people to not do things.

    3. Trip wires are pretty fake, very few people make such explicit commitments and I think the rate of people sticking to them is also not very high.

    4. The main exception is for people who:

      1. Are pursuing research directions that require weight-level access to non-public models, OR whose research requires too much compute to pursue outside of a lab

      2. AND where the lab gives you the academic freedom to pursue the above research

      3. AND you’re highly disagreeable (so relatively immune to peer pressure) and immune to financial incentives.

    5. But the intersection of the above cases is basically null.

      1. There’s lots of research to be done and other ways to acquire resources for research (e.g. as a contractor or via an AISI/​the US government).

      2. Academic freedom is very rare at labs unless you’re a senior professor/​researcher that the labs are fond of.

      3. People who get the chance to work at labs are selected for being more agreeable, and also people greatly underestimate the effect of peer pressure and financial incentives on average.

  3. Third, I argue that for people in this position, it’s almost certainly the case that there are better options.

    1. There’s lots of independent AIS research to be done.

      1. Not to say that people should be independent AIS researchers – I agree that their output is pretty low, but I argue that it’s because independent research sucks, not because there isn’t work to be done.

    2. Empirically people’s research output drops after going to labs

      1. Anthropic interp seems bogged down due to having to work on large models + being all in on one particular agenda

      2. [A few researchers, name redacted]’s work are much less exciting now

      3. The skills you build on the margin don’t seem that relevant to safety either, as opposed to scaling/​working with big models.

    3. There’s a ton of policy or other positions that people don’t take due to financial incentives, but people who care about doing the right thing can.

      1. e.g. red lines, independent evals

      2. Empirically seems difficult to hire for METR/​other EA orgs due in large part to financial incentives + lack of social support.

  4. I argue from the deontological or coordination perspective, that the world would be better if people didn’t do things they thought were bad, that is, the means rarely if ever justify the ends.

    1. This is a pretty good heuristic imo, though a bit hard to justify on naive utilitarian grounds.

    2. But it seems likely reasonable rule utilitarianism would say not to do it.

Verbal Interrogation

We questioned each other for ~20 minutes, which informed the following discussion.

Ben’s First Rebuttal

Ben Pace

It seems that we’re arguing substantially about in-principle vs in-practice. Lawrence mostly wanted to ask if there were literally any people in any of the categories that I could point to, and argued that many people currently there trick themselves into thinking they’re in those categories.

I felt best defending the second category.

I then also want to push back harder on Lawrence’s third argument, that “for people in this position, it’s almost certainly the case that there are better options.”

Can people have higher integrity & accountability standards than ~everyone else at these orgs?

I think that Lawrence and I disagree about how easy it would be for someone to give themselves way better incentives if they just tried to. My guess is that there are non-zero people who would consider joining a capabilities company, who would sign up to have a personal board of (say) me, Lawrence, and Daniel Kokotajlo, to check-in with quarterly, and who would agree to quit their job if 2 of the 3 of us fired them. They can also publicly list their red-lines ahead of joining.

Perhaps if the organization got wind of this then they would not hire such a person. I think this would certainly be a sign that the org was not one you could work at ethically. I can also imagine the person doing this privately and not telling the organization, in a more adversarial setup (like an internal spy or corporate whistleblower). I’d have to think through the ethics of that, my current guess is that I believe I should never support such an effort, but I’ve not thought about it much. (I’d be interested in reading or having a debate on the ethics of corporate spies.)

Lawrence disagrees that people would in-practice be open to this, or that if they did so then they would get hired.

My guess is that I’d like to try these ideas with some people who are considering getting hired, or perhaps one or two people I have in mind in-particular.

Do people have better alternatives?

Lawrence writes above that, in general, the people going to these orgs have better alternatives elsewhere. I agree this is true for a few select individuals but I don’t think that this is the case for most people.

Let’s very roughly guess that there are 100 “safety” researchers and 100 policy researchers at capabilities companies. I do not believe there is a similar sized independent ecosystem of organizations for them to join. For most of these people, I think their alternative is to go home (a phrase which here means “get a job at a different big tech company or trading company that pays 50%-100% of the salary they were previously getting”). I concur that I’d like to see more people go independent but I think it’s genuinely hard to get funding and there’s little mentorship and little existing orgs to fit into, so you need to work hard (in the style of a startup founder) to do this.

My sense is that it would be good for a funder like Open Philanthropy to flood the independent scene with funding so that they can hire competitively, though there are many hidden strings involved in that money already which may easily make that action net negative.

But I am like “if you have an opportunity to work at an AI org, and one of my four reasons holds, I think it is quite plausible that you have literally no other options for doing good, and the only default to consider is to live a quiet life”. I want to push back on the assumption that there are always lots of good options to directly improve things. I mostly do not want to flood a space with random vaguely well-intentioned people who haven’t thought about things much and do not have strong character or epistemologies or skills, and so disagree that the people reliably have better options for helping reduce existential risk from AI.

Lawrence’s First Rebuttal

LawrenceC

After talking with Ben, it seems that we agree on the following points:

  1. It’s possible in principle that you could be in an epistemic situation where it’s correct to join a lab.

  2. It’s rare in practice for people to be in the situations described by Ben above. There are maybe <10 examples of people being any of 1, 2, 3, 4, out of thousands who have joined a lab citing safety concerns.

  3. The financial and social incentives to join a lab are massive, ergo more people join than should just based on ethical/​impact grounds, ergo you should be suspicious a priori when someone claims that they’re doing it on ethical grounds.

Our disagreements seem to be:

  1. Ben thinks it’s more plausible to make tripwires than I seem to.This seems empirically testable, and we should test it.

  2. I seem to believe that there are better opportunities for people outside of labs than Ben does. My guess is a lot of this is because I’m more bullish on existing alignment research than Ben is. Maybe I also believe that social pressures are harder to fix than practical ones like raising money or having an org. It’s also plausible that I see a selected subgroup of people (e.g. those that METR/​FAR want to hire, or the most promising researchers in the field), and that on average most people don’t have good options. Note that this doesn’t mean that Ben disagrees that fewer people should be at labs at the margin – he just thinks that these people should go home.

But I think these are pretty minor points.

I think it’s a bit hard to rebut Ben’s points when I agree with them in principle, but my contention is just that it’s rare for anyone to ever find themselves in such situations in practice, and that people who think they’re in such a situation tend to on average be wrong.

That being said, here are my quick responses to Ben’s main points:

First, I think few if any people who join labs have such a concrete mechanism in mind, and most do it in large part due to financial or status reasons. Second, I think the number of examples where someone has successfully mitigated damage in the AI capabilities context is really small; there’s a reason we keep talking about Daniel Kokotajlo. Third, there are zero examples of people doing this, in large part because you cannot make a $10B capabilities lab while saying this. Fourth, the majority of people who think they’re in this situation have been wrong. I’m sure many, many people joined the Nazis because they thought “better me in charge than a real Nazi”, and people definitely have thought this way in Washington DC in many contexts. Also, as Ben points out, there are many virtue ethical or deontological reasons to not pursue such a policy.

Perhaps my best argument is: what’s the real point of saying something is ethical? I don’t think we should focus on whether or not things are ethical in principle – this leads people down rabbit holes where they use contrived situations to justify violence against political opponents (“punch a nazi”) or torture (the “ticking time bomb” terrorism scenarios). Ethics should inform what to do in practice, based primarily on the situations that people find themselves in, and not based on extreme edge cases that people are unlikely to ever find themselves in.

Ben’s Second Rebuttal

Ben Pace

Briefly responding to two points:

I’m sure many, many people joined the Nazis because they thought “better me in charge than a real Nazi”

I’m not sure I agree with this. My guess is currently people are doing way more “I’m actually a good person so it’s good for me to be the person building the extinction machine” than people though “I’m actually a good person so it’s good for me to be the person running the concentration camp”. I think the former is partly because there’s a whole community of people who identify as good (“effective altruists”) who have poured into the AI scene and there wasn’t an analogous social group in the German political situation. I do think that there were some lone heroes in Nazi Germany (Martin Sustrik has written about them) but there was less incentive to be open about it, and there was a lot of risk involved. My guess is that on some dimensions I’d feel better about someone at a capabilities company who was attempting this sort of internal sabotage in terms of them being open and honest with themselves. I don’t endorse it in the current adversarial and political environment, seems like a crazy defection on US/​UK society (I’d be interested to debate that topic with someone; in today’s society, is it credible that we’ll get to a point where having corporate spies is ethical?). But overall my guess is that staff in Nazi germany didn’t think about it too much or self-deceived, rather than thinking about themselves as more ethical than the alternative. (“Wow, it sucks that I have to do this job, but I think it’s the best choice for me and my family, and so I will try to avoid thinking about it otherwise.”)

[After chatting, it does seem that me and Lawrence have a disagreement about how often people take these jobs due to selfish internal accounts and how much they give altruistic internal accounts.]

what’s the real point of saying something is ethical?

What’s the point of arguing ethics? I guess it’s good to know where the boundaries lie. I think it helps to be able to notice if you’re in a situation where you actually should work for one of these companies. And it helps to notice if you’re nowhere near such a situation.

Final Note

I think you did a good job of laying out where we agreed and disagreed. I am also curious to discuss tripwires & counter-incentives at more length sometime. I found this debate very helpful for clarifying the principles and the practice!

Lawrence cut his final rebuttal as he felt it mostly repeated prior discussion from above.