I would wire you guys 300-400K today if I wasn’t still worried about the theory that ‘AI Safety is actually a front for funding advancement of AI capabilities’. It is a quixotic task to figure out how true that theory is or what actually happened in the past, neverminded why. But the theory seems at least kind of true to me and so I will not be donating.
Its unlikely to be worth your time to try to convince me to donate. But maybe other potential donors would appreciate a reassurance its not actively net-negative to donate. For example several people mentioned in the post have ties to dangerous organizations such as Anthropic.
Meta-honesty: There is not enough values alignment to trust me with sensitive information and I definitely do not endorse ‘always keep secrets you agreed to keep’. I support leaking the pentagon papers, etc.
My own professional opinion, not speaking for any other grantmakers or giving an institutional view for LTFF etc:
Yeah I sure can’t convince you that donating to us is definitely net positive, because such a claim wouldn’t be true.
So basically I don’t think it’s possible to do robustly positive actions in longtermism with high (>70%? >60%?) probability of being net positive for the long-term future[1], and this number is even lower for people who don’t place the majority of their credence on near- to medium-term extinction risk timelines.
I don’t think this is just an abstract theoretical risk, as you mention there’s a real risk that our projects are net negative; and advancing more AI capabilities than AI safety is the most obvious way that this is true.
I think the other LTFF grantmakers and I are pretty conscious about downside risks in capabilities enhancements, though I expect there’s a range of opinions on the fund on how much to weigh that against other desiderata, as well as which specific projects have the highest capabilities externalities.
I would guess that we’re better about this than most (all?) other significant longtermist funders, including both organizations and individuals (though keep in mind that the average for individuals is driven by the long left tail). But since we’re optimizing for other things as well (most importantly positive impact), I think we’d do worse than you would on this axis if you a) have reasonably good judgment b) are laser-focused on preventing capabilities externalities, and c) have access to good donation options directly, especially by your own worldview. And of course reality doesn’t grade on a curve, so doing better than other funders isn’t a guarantee we’re doing well enough.
I don’t do much evaluations of alignment grants myself because others on the fund seem more technically qualified so my time is usually triaged to looking at other projects (eg forecasting, biosecurity). But I do try to flag downside risks I see in ltff grants overall, including in alignment grants. (So far, I think the rest of the fund is sensible about capabilities risks and capabilities risks usually aren’t the type of thing that non-public information is super useful for, so possibly none of my flags were on capabilities, more like interpersonal harm or professional integrity). When I did, I’ve found the rest of the fund to be sensible about them. You might find this recent post to be useful.
(On the flip side, there were a small number of grants that I liked that we were blocked from making for legal or PR reasons; for the most promising ones, one of us tried to connect the applicant to other funders)
If I were to hypothesize why LessWrongers should be worried about our capabilities externalities:
I think the average view in the fund (both unweighted and weighted by votes on alignment grants) is more optimistic on prosaic AI alignment strategies than what I perceive the median LessWrong view to be.
I expect under most worldviews, prosaic AI alignment to have more capabilities externalities than other research agendas
To be clear, I don’t think views in the fund to be out-of-line with working AI safety researchers; I think the louder (and probably median?) voices on LessWrong are more negative on prosaic approaches.
Some of our grantees go on to work at AI labs like Anthropic or DeepMind, which many people here would consider to be bad for the world.
My own weakly-to-moderately held view is that doing AI Safety work at big labs is a good-to-great idea, but don’t think the case is very robust and reasonable people can and should disagree.
As you allude, an important crux is whether/how much the work at the labs end up being safety-washing
I’m personally fairly against working at big labs in non-safety roles; the capabilities externalities just seem rather high, and the career capital argument seem a) both not that high compared to getting a random ML job at Google doing ads or working at collision detection at Tesla or something and b) to rely implicitly on a certain willingness to defect for personal gain.
The moral mazes and institutional/cultural incentives to warp your beliefs seem pretty scary to me, but I don’t have a good solution.
We are not institutionally opposed to receiving money from employees at big labs
Though as an empirical matter I don’t think we’ve received much.
The ecosystem/memes/vibes near us has in fact resulted in a bunch of negative externalities before, there’s no guarantee we wouldn’t cause the same.
We haven’t tracked past negative externalities/negative impact grants very well, so I couldn’t eg point to our10 worst grants ex post with an estimate of how bad there were (but we’re working on this).
We didn’t see the FTX crash coming.
I also think potential donors to us can also just look at our past grants database,our payout report, or our marginal grants post to make an informed decision for themselves about whether donations to us are (sufficiently) net positive in expectation.
On a personal level: I don’t really know, man? I think the longtermist/rationalist EA memes/ecosystem were very likely causally responsible for some of the worst capabilities externalities in the last decade; I don’t have a sense of how bad it is overall because counterfactuals are really hard but I don’t think it’s plausible that the negative impact was small. I’m pretty confused about whether people with thought process like me have been historically net positive or net negative; I can see a strong case either way. The whole thing had a pretty direct effect on me being depressed for most of this year (with the obvious caveat that etiology is hard for mental illness stuff, and being sad for cosmic reasons is one of the most self-flattering stories I could have for melancholy). Interestingly, I think the emotional effect is much larger than I would’ve ex ante predicted, if you asked me in 2017 if I thought longtermist work might be net negative, I don’t think my numbers would’ve been that different; I guess the specific details and concreteness did matter.
I have a lot of sympathy for people who decided to be a bit more checked out of morality, or decided to give up on this whole AI thing and focus on just reducing suffering in the next few decades (I think farmed animal welfare is the most popular candidate). But ultimately I think they’re wrong. The future is still going to be big, and likely really wild, and likely at least somewhat contingent. Knowing (or at least having a high probability) that people near us did a bunch of harmful stuff in the past is certainly an argument for being much more careful going forwards (as well as a number of more concrete and specific updates), but not really a good case to just roll over. (In the abstract, I do think it’s more plausible that for some people acting now is wrong compared to retreating to the woods for a year and thinking really hard; as an empirical matter when I did weaker versions of that, the effect was basically between useless and negative).
I think it’s a bit more feasible if you’re willing to make >3 OOMs sacrifice in expected positive impact. But still pretty rough. Some green energy stuff might be safe? Maybe try to convince doomsday preppers to be nicer people? I confess to not thinking much about it; I think some of the Oxford people might have a better idea.
I think the longtermist/rationalist EA memes/ecosystem were very likely causally responsible for some of the worst capabilities externalities in the last decade;
If you’re thinking of the work I’m thinking of, I think about zero of it came from people aiming at safety work and producing externalities, and instead about all of it was people in the community directly working on capabilities or capabilities-adjacent projects, with some justification or the other.
Yeah most of the things I’m thinking of didn’t look like technical safety stuff, more like Demis and Shane being concerned about safety → decided to found Deepmind, Eliezer introducing Demis and Shane to Peter Thiel ( their first funder), etc.
In terms of technical safety stuff, sign confusion around RLHF is probably the strongest candidate. I’m also a bit worried about capabilities externalities of Constitutional AI, for similar reasons. There’s also the general vibes issue of safety work (including quite technical work) and communications either making AI capabilities seem more cool* or seem less evil (depending on your framing).
EDIT to add: I feel like in Silicon Valley (and maybe elsewhere but I’m most familiar with Silicon Valley) there’s a certain vibe of coolness being more important than goodness, which feels childish to me but afaict seems like a real thing. This Altman tweet seems emblematic of that mindset.
I feel like in Silicon Valley (and maybe elsewhere but I’m most familiar with Silicon Valley) there’s a certain vibe of coolness being more important than goodness
Yeah, I definitely think this is true to some extent. “First get impact, then worry about the sign later” and all.
So basically I don’t think it’s possible to do robustly positive actions in longtermism with high (>70%? >60%?) probability of being net positive for the long-term future
This seems like an important point, and it’s one I’ve not heard before. (At least, not outside of cluelessness or specific concerns around AI safety speeding up capabilities; I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future.)
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little? For instance, is there a theoretical argument going on here, like a weak form of cluelessness? Or is it more empirical, for example, did you get here through evaluating a bunch of grants and noticing that even the best seem to carry 30-ish percent downside risk? Something else?
I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future).
Really? Without giving away names, can you tell me roughly what cluster they are in? Geographical area, age range, roughly what vocation (technical AI safety/AI policy/biosecurity/community building/earning-to-give)?
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little? For instance, is there a theoretical argument going on here, like a weak form of cluelessness? Or is it more empirical,
Definitely closer to the former than the latter! Here are some steps in my thought process:
The standard longtermist cluelessness arguments (“you can’t be sure if eg improving labor laws in India is good because it has uncertain effects on the population and happiness of people in Alpha Centauri in the year 4000”) doesn’t apply in full-force if you buy high near-term (10-100 years) probability of AI doom, and that AI doom is astonomically bad and avoidable.
or (less commonly on LW but more common in some other EA circles) other sources of hinge of history like totalitarian lock-in, s-risks, biological tech doom, etc
If you assign low credence in any hinge of history hypothesis, I think you are still screwed by the standard cluelessness arguments, unfortunately.
But even with a belief in x-risk hinge of history, cluelessness still apply significantly. Knowing whether an action reduces x-risk is much easier in relative terms than knowing whether an action will improve the far future in the absence of x-risk, but it’s still hard in absolute terms.
If we drill down on a specific action and a specific theory of change (“I want to convince a specific Senator to sign a specific bill to regulate the size of LLM models trained in 2024”, “I want to do this type of technical research to understand this particular bug in this class of transformer models, because better understanding of this bug can differentially advance alignment over capabilities at Anthropic if Anthropic will scale up this type of model”), any particular action’s impact is just built on a tower of conjunctions and it’s really hard to get any grounding to seriously argue that it’s probably positive.
So how do you get any robustness? You imagine the set of all your actions as slightly positive bets/positively biased coin flips (eg a grantmaker might investigate 100+ grants in a year, something like deconfusion research might yield a number of different positive results, field-building for safety might cause a number of different positive outcomes, you can earn-to-give for multiple longtermist orgs, etc). If heads are “+1” and tails are “-1″, and you have a lot of flips, then the central limit theorem gets you a nice normal distribution with a positive mean and thin tails.
Unfortunately the real world is a lot less nice than this because:
A concrete example is that maybe a really unexpectedly bad grant can wipe out all of the positive impact your good grants have gotten, and then some.
the impact and theories of change of all your actions likely share a worldview and have internal correlations
eg, “longtermist EA fieldbuilding” have multiple theories of impact, but you can be wrong about a few important things and e.g. (almost) all of them might end up differentially advancing capabilities over alignment, in very correlated ways.
You might not have all that many flips that matter
The real world is finite, your life is finite, etc, so even if in the limit your approach is net positive, there’s no guarantee that in practice your actions are net positive before either you die or the singularity happens.
That doesn’t mean it’s wrong to dedicate your life to a single really important bet! (as long as you are obeying reasonable deontological and virtue ethics constraints, you’re trying your best to be reasonable, etc).
For people in those shoes, a possibly helpful mental motion is to try to think less of individual impact and more communally. Maybe it’s like voting: individual votes are ~useless but collectively people-who-think-like-you can hopefully vote for a good leader. If enough people-like-you follow an algorithm of “do unlikely-to-work research projects that are slightly positive in expectation”, collectively we can do something important.
probably a few other things I’m missing.
So the central modeling issues become a) how many flips you get, b) how likely all the flips are dominated by a single coin, c) how much internal correlation there is between each coin flip.
And my gut is like, it seems like you get a fair number of flips, it’s reasonably likely but not certain that one (or a few) flips dominate, and the internal correlation is high but not 1(and not very close to 1).
There’s a few more thoughts I have but that’s the general gist. Unfortunately it’s not very mathematical/quantitive or much of a model; my guess is that both more conceptual thinking and more precise models can yield some more clarity, but ultimately we (or at least I) will still end up fairly confused even after that.
I’m also interested in thoughts from other people here; I’m sure I’m not the only person who is worried about this type of thing.
(Also please don’t buy my exact probabilities. They are very much not resilient. Like I’m pretty sure if I thought about it for 10 years (without new empirical information) the probability can’t be much higher than 90%, and I’m pretty sure the probabilities are high enough to be non-Pascalian, so not as low as say 50% + 1-in-a-quadrallion, but anywhere in between seems kinda defensible).
“I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future”
Fwiw, I think this is probably true for very few if any of the EAs I’ve worked with, though that’s a biased sample.
I wonder if the thing giving you this vibe might be they they actually think something like “I’m not that confident that my work is net positive for the LTF but my best guess is that it’s net positive in expectation. If what I’m doing is not positive, there’s no cheap way for me to figure it out, so I am confident (though not ~100%) that my work will keep seeming positive EV to me for the near future.” One informal way to describe this is that they are confident that their work is net positive in expectation/ex ante but not that it will be net positive ex post
I think this can look a lot like somebody being ~sure that what they’re doing is net positive even if in fact they are pretty uncertain.
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little?
One way I think about this is there are just so many weird (positive and negative) feedback loops and indirect effects, so it’s really hard to know if any particular action is good or bad. Let’s say you fund a promising-seeming area of alignment research – just off the top of my head, here are several ways that grant could backfire: • the research appears promising but turns out not to be, but in the meantime it wastes the time of other alignment researchers who otherwise would’ve gone into other areas
• the research area is promising in general, but the particular framing used by the researcher you funded is confusing, and that leads to slower progress than counterfactually
• the researcher you funded (unbeknownst to you) turns out to be toxic or otherwise have bad judgment, and by funding him, you counterfactually poison the well on this line of research
• the area you fund sees progress and grows, which counterfactually sucks up lots of longtermist money that otherwise would have been invested and had greater effect (say, during crunch time)
• the research is somewhat safety-enhancing, to the point that labs (facing safety-capabilities tradeoffs) decide to push capabilities further than they otherwise would, and safety is hurt on net
• the research is somewhat safety-enhancing, to the point that it prevents a warning shot, and that warning shot would have been the spark that would have inspired humanity to get its game together regarding combatting AI X-risk
• the research advances capabilities, either directly or indirectly
• the research is exciting and draws the attention of other researchers into the field, but one of those researchers happens to have a huge, tail negative effect on the field outweighing all the other benefits (say, that particular researcher has a very extreme version of one of the above bullet points)
• Etcetera – I feel like I could do this all day.
Some of the above are more likely than others, but there are just so many different possible ways that any particular intervention could wind up being net negative (and also, by the same token, could alternatively have indirect positive effects that are similarly large and hard to predict).
Having said that, it seems to me that on the whole, we’re probably better off if we’re funding promising-seeming alignment research (for example), and grant applications should be evaluated within that context. On the specific question of safety-conscious work leading to faster capabilities gains, insofar as we view AI as a race between safety and capabilities, it seems to me that if we never advanced alignment research, capabilities would be almost sure to win the race, and while safety research might bring about misaligned AGI somewhat sooner than it otherwise would occur, I have a hard time seeing how it would predictably increase the chances of misaligned AGI eventually being created.
I’m not sure which of the people “have ties to dangerous organizations such as Anthropic” in the post (besides Shauna Kravec & Nova DasSarma, who work at Anthropic), but of the current fund managers, I suspect that I have the most direct ties to Anthropic and OAI through my work at ARC Evals. I also have done a plurality of grant evaluations in AI Safety in the last month. So I think I should respond to this comment with my thoughts.
I personally empathize significantly with the concerns raised by Linch and Oli. In fact, when I was debating joining Evals last November, my main reservations centered around direct capabilities externalities and safety washing.
I will say the following facts about AI Safety advancing capabilities:
Empirically, when we look at previous capability advancements produced by people working in the name of “AI Safety” from this community, the overwhelming majority were produced by people who were directly aiming to improve capabilities.
That is, they were not capability externalities from safety research, so much as direct capabilities work.
E.g, it definitely was not the case that GPT-3 was a side effect of alignment research, and OAI and Anthropic are both orgs who explicitly focus on scaling and keeping at the frontier of AI development.
I think the sole exception are a few people who started doing applied RLHF research. Yeah, I think the people who made LLMs commercially viable via did not do a good thing. My main uncertainty is what exactly happened here and how much we contribute to this on the margin.
I generally think that research is significantly more useful when it is targeted (this is a very common view in the community as well). I’m not sure what the exact multiplier is, but I think targeted, non-foundational research is probably 10x more effective than incidentally related research. So the net impact of safety research on capabilities via externalities is probably significantly smaller than the impact of safety research on safety research, or the impact of targeted capabilities research on capabilities research.
I think this point is often overstated or overrated, but the scale of capabilities researchers at this point is really big, and it’s easy to overestimate the impact of one or two particular high profile people.
For what it’s worth, I think that if we are to actually produce good independent alignment research, we need to fund it, and LTFF is basically the only funder in this space. My current guess is a lack of LTFF funding is probably producing more researchers at Anthropic than otherwise, because there just that aren’t many opportunities for people to work on safety or safety-adjacent roles. E.g. I know of people who are interviewing for Anthropic capability teams because idk man, they just want a safety-adjacent job with a minimal amount of security, and it’s what’s available. Having spoken to a bunch of people, I strongly suspect that of the people that I’d want to fund but won’t be funded, at least a good fraction are significantly less likely to join a scaling lab if they were funded, and not more.
(Another possibly helpful datapoint here is that I received an offer from Anthropic last december, and I turned them down.)
My current guess is a lack of LTFF funding is probably producing more researchers at Anthropic than otherwise, because there just that aren’t many opportunities for people to work on safety or safety-adjacent roles. E.g. I know of people who are interviewing for Anthropic capability teams because idk man, they just want a safety-adjacent job with a minimal amount of security, and it’s what’s available. Having spoken to a bunch of people, I strongly suspect that of the people that I’d want to fund but won’t be funded, at least a good fraction are significantly less likely to join a scaling lab if they were funded, and not more.
I think this is true at the current margin, because we have so limited money.. But if we receive say enough funding to lower the bar to closer to what our early 2023 bar was, I will still want to make skill-up grants to fairly talented/promising people, and I still think they are quite cost-effective. I do expect those grants to have more capabilities externalities (at least in terms of likelihood, maybe in expectation as well) than when we give grants to people who currently could be hired at (eg) Anthropic but choose not to.
It’s possible you (and maybe Oli?) disagree and think we should fund moderate-to-good direct work projects over all (or almost all) skillup grants; in that case this is a substantive disagreement about what we should do in the future.
E.g. I know of people who are interviewing for Anthropic capability teams because idk man, they just want a safety-adjacent job with a minimal amount of security, and it’s what’s available
That feels concerning. Are there any obvious things that would help with this situation, eg: better career planning and reflection resources for people in this situation, AI safety folks being more clear about what they see as the value/disvalue of working in those types of capability roles?
Seems weird for someone to explicitly want a “safety-adjacent” job unless there are weird social dynamics encouraging people to do that even when there isn’t positive impact to be had from such a job.
FWIW, I am also very worried about this and it feels pretty plausible to me. I don’t have any great reassurances, besides me thinking about this a lot and trying somewhat hard to counteract it in my own grant evaluations, but I only do a small minority of grant evaluations on the LTFF these days.
I do want to clarify that I think it’s unlikely that AI Safety is a front for advancing AI capabilities. I think the framing that’s more plausibly true is that AI Safety is a memespace that has undergone regulatory capture by capability companies and people in the EA network to primarily build out their own influence over the world.
Their worldviews is of course heavily influenced by concerns about the future of humanity and how it will interact with AI, but in a way that primarily leverages symmetric weapons and does not involve much of any accountability or public reasoning about their risk models, which seem substantially skewed by the fact that people are making billions of dollars off of advances in AI capabilities, and are substantially worried that people they don’t like will get to control AI.
I do also think this is just one framing, and there are a lot of other things going on.
Have you looked at Orthogonal? They’re pretty damn culturally inoculated against doing-capabilities-(even-by-accident), and they’re extremely funding constrained.
I would wire you guys 300-400K today if I wasn’t still worried about the theory that ‘AI Safety is actually a front for funding advancement of AI capabilities’. It is a quixotic task to figure out how true that theory is or what actually happened in the past, neverminded why. But the theory seems at least kind of true to me and so I will not be donating.
Its unlikely to be worth your time to try to convince me to donate. But maybe other potential donors would appreciate a reassurance its not actively net-negative to donate. For example several people mentioned in the post have ties to dangerous organizations such as Anthropic.
Meta-honesty: There is not enough values alignment to trust me with sensitive information and I definitely do not endorse ‘always keep secrets you agreed to keep’. I support leaking the pentagon papers, etc.
My own professional opinion, not speaking for any other grantmakers or giving an institutional view for LTFF etc:
Yeah I sure can’t convince you that donating to us is definitely net positive, because such a claim wouldn’t be true.
So basically I don’t think it’s possible to do robustly positive actions in longtermism with high (>70%? >60%?) probability of being net positive for the long-term future[1], and this number is even lower for people who don’t place the majority of their credence on near- to medium-term extinction risk timelines.
I don’t think this is just an abstract theoretical risk, as you mention there’s a real risk that our projects are net negative; and advancing more AI capabilities than AI safety is the most obvious way that this is true.
I think the other LTFF grantmakers and I are pretty conscious about downside risks in capabilities enhancements, though I expect there’s a range of opinions on the fund on how much to weigh that against other desiderata, as well as which specific projects have the highest capabilities externalities.
I would guess that we’re better about this than most (all?) other significant longtermist funders, including both organizations and individuals (though keep in mind that the average for individuals is driven by the long left tail). But since we’re optimizing for other things as well (most importantly positive impact), I think we’d do worse than you would on this axis if you a) have reasonably good judgment b) are laser-focused on preventing capabilities externalities, and c) have access to good donation options directly, especially by your own worldview. And of course reality doesn’t grade on a curve, so doing better than other funders isn’t a guarantee we’re doing well enough.
I don’t do much evaluations of alignment grants myself because others on the fund seem more technically qualified so my time is usually triaged to looking at other projects (eg forecasting, biosecurity). But I do try to flag downside risks I see in ltff grants overall, including in alignment grants. (So far, I think the rest of the fund is sensible about capabilities risks and capabilities risks usually aren’t the type of thing that non-public information is super useful for, so possibly none of my flags were on capabilities, more like interpersonal harm or professional integrity). When I did, I’ve found the rest of the fund to be sensible about them. You might find this recent post to be useful.
(On the flip side, there were a small number of grants that I liked that we were blocked from making for legal or PR reasons; for the most promising ones, one of us tried to connect the applicant to other funders)
If I were to hypothesize why LessWrongers should be worried about our capabilities externalities:
I think the average view in the fund (both unweighted and weighted by votes on alignment grants) is more optimistic on prosaic AI alignment strategies than what I perceive the median LessWrong view to be.
I expect under most worldviews, prosaic AI alignment to have more capabilities externalities than other research agendas
To be clear, I don’t think views in the fund to be out-of-line with working AI safety researchers; I think the louder (and probably median?) voices on LessWrong are more negative on prosaic approaches.
Some of our grantees go on to work at AI labs like Anthropic or DeepMind, which many people here would consider to be bad for the world.
My own weakly-to-moderately held view is that doing AI Safety work at big labs is a good-to-great idea, but don’t think the case is very robust and reasonable people can and should disagree.
As you allude, an important crux is whether/how much the work at the labs end up being safety-washing
I’m personally fairly against working at big labs in non-safety roles; the capabilities externalities just seem rather high, and the career capital argument seem a) both not that high compared to getting a random ML job at Google doing ads or working at collision detection at Tesla or something and b) to rely implicitly on a certain willingness to defect for personal gain.
The moral mazes and institutional/cultural incentives to warp your beliefs seem pretty scary to me, but I don’t have a good solution.
We are not institutionally opposed to receiving money from employees at big labs
Though as an empirical matter I don’t think we’ve received much.
The ecosystem/memes/vibes near us has in fact resulted in a bunch of negative externalities before, there’s no guarantee we wouldn’t cause the same.
We haven’t tracked past negative externalities/negative impact grants very well, so I couldn’t eg point to our10 worst grants ex post with an estimate of how bad there were (but we’re working on this).
We didn’t see the FTX crash coming.
I also think potential donors to us can also just look at our past grants database, our payout report, or our marginal grants post to make an informed decision for themselves about whether donations to us are (sufficiently) net positive in expectation.
On a personal level:
I don’t really know, man? I think the longtermist/rationalist EA memes/ecosystem were very likely causally responsible for some of the worst capabilities externalities in the last decade; I don’t have a sense of how bad it is overall because counterfactuals are really hard but I don’t think it’s plausible that the negative impact was small. I’m pretty confused about whether people with thought process like me have been historically net positive or net negative; I can see a strong case either way. The whole thing had a pretty direct effect on me being depressed for most of this year (with the obvious caveat that etiology is hard for mental illness stuff, and being sad for cosmic reasons is one of the most self-flattering stories I could have for melancholy). Interestingly, I think the emotional effect is much larger than I would’ve ex ante predicted, if you asked me in 2017 if I thought longtermist work might be net negative, I don’t think my numbers would’ve been that different; I guess the specific details and concreteness did matter.
I have a lot of sympathy for people who decided to be a bit more checked out of morality, or decided to give up on this whole AI thing and focus on just reducing suffering in the next few decades (I think farmed animal welfare is the most popular candidate). But ultimately I think they’re wrong. The future is still going to be big, and likely really wild, and likely at least somewhat contingent. Knowing (or at least having a high probability) that people near us did a bunch of harmful stuff in the past is certainly an argument for being much more careful going forwards (as well as a number of more concrete and specific updates), but not really a good case to just roll over. (In the abstract, I do think it’s more plausible that for some people acting now is wrong compared to retreating to the woods for a year and thinking really hard; as an empirical matter when I did weaker versions of that, the effect was basically between useless and negative).
I think it’s a bit more feasible if you’re willing to make >3 OOMs sacrifice in expected positive impact. But still pretty rough. Some green energy stuff might be safe? Maybe try to convince doomsday preppers to be nicer people? I confess to not thinking much about it; I think some of the Oxford people might have a better idea.
I truly. truly appreciate reading this.
If you’re thinking of the work I’m thinking of, I think about zero of it came from people aiming at safety work and producing externalities, and instead about all of it was people in the community directly working on capabilities or capabilities-adjacent projects, with some justification or the other.
(personal opinions)
Yeah most of the things I’m thinking of didn’t look like technical safety stuff, more like Demis and Shane being concerned about safety → decided to found Deepmind, Eliezer introducing Demis and Shane to Peter Thiel ( their first funder), etc.
In terms of technical safety stuff, sign confusion around RLHF is probably the strongest candidate. I’m also a bit worried about capabilities externalities of Constitutional AI, for similar reasons. There’s also the general vibes issue of safety work (including quite technical work) and communications either making AI capabilities seem more cool* or seem less evil (depending on your framing).
EDIT to add: I feel like in Silicon Valley (and maybe elsewhere but I’m most familiar with Silicon Valley) there’s a certain vibe of coolness being more important than goodness, which feels childish to me but afaict seems like a real thing. This Altman tweet seems emblematic of that mindset.
Yeah, I definitely think this is true to some extent. “First get impact, then worry about the sign later” and all.
This seems like an important point, and it’s one I’ve not heard before. (At least, not outside of cluelessness or specific concerns around AI safety speeding up capabilities; I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future.)
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little? For instance, is there a theoretical argument going on here, like a weak form of cluelessness? Or is it more empirical, for example, did you get here through evaluating a bunch of grants and noticing that even the best seem to carry 30-ish percent downside risk? Something else?
Really? Without giving away names, can you tell me roughly what cluster they are in? Geographical area, age range, roughly what vocation (technical AI safety/AI policy/biosecurity/community building/earning-to-give)?
Definitely closer to the former than the latter! Here are some steps in my thought process:
The standard longtermist cluelessness arguments (“you can’t be sure if eg improving labor laws in India is good because it has uncertain effects on the population and happiness of people in Alpha Centauri in the year 4000”) doesn’t apply in full-force if you buy high near-term (10-100 years) probability of AI doom, and that AI doom is astonomically bad and avoidable.
or (less commonly on LW but more common in some other EA circles) other sources of hinge of history like totalitarian lock-in, s-risks, biological tech doom, etc
If you assign low credence in any hinge of history hypothesis, I think you are still screwed by the standard cluelessness arguments, unfortunately.
But even with a belief in x-risk hinge of history, cluelessness still apply significantly. Knowing whether an action reduces x-risk is much easier in relative terms than knowing whether an action will improve the far future in the absence of x-risk, but it’s still hard in absolute terms.
If we drill down on a specific action and a specific theory of change (“I want to convince a specific Senator to sign a specific bill to regulate the size of LLM models trained in 2024”, “I want to do this type of technical research to understand this particular bug in this class of transformer models, because better understanding of this bug can differentially advance alignment over capabilities at Anthropic if Anthropic will scale up this type of model”), any particular action’s impact is just built on a tower of conjunctions and it’s really hard to get any grounding to seriously argue that it’s probably positive.
So how do you get any robustness? You imagine the set of all your actions as slightly positive bets/positively biased coin flips (eg a grantmaker might investigate 100+ grants in a year, something like deconfusion research might yield a number of different positive results, field-building for safety might cause a number of different positive outcomes, you can earn-to-give for multiple longtermist orgs, etc). If heads are “+1” and tails are “-1″, and you have a lot of flips, then the central limit theorem gets you a nice normal distribution with a positive mean and thin tails.
Unfortunately the real world is a lot less nice than this because:
the impact of your different actions are heavy-tailed, likely in both directions.
A concrete example is that maybe a really unexpectedly bad grant can wipe out all of the positive impact your good grants have gotten, and then some.
the impact and theories of change of all your actions likely share a worldview and have internal correlations
eg, “longtermist EA fieldbuilding” have multiple theories of impact, but you can be wrong about a few important things and e.g. (almost) all of them might end up differentially advancing capabilities over alignment, in very correlated ways.
You might not have all that many flips that matter
The real world is finite, your life is finite, etc, so even if in the limit your approach is net positive, there’s no guarantee that in practice your actions are net positive before either you die or the singularity happens.
That doesn’t mean it’s wrong to dedicate your life to a single really important bet! (as long as you are obeying reasonable deontological and virtue ethics constraints, you’re trying your best to be reasonable, etc).
For people in those shoes, a possibly helpful mental motion is to try to think less of individual impact and more communally. Maybe it’s like voting: individual votes are ~useless but collectively people-who-think-like-you can hopefully vote for a good leader. If enough people-like-you follow an algorithm of “do unlikely-to-work research projects that are slightly positive in expectation”, collectively we can do something important.
probably a few other things I’m missing.
So the central modeling issues become a) how many flips you get, b) how likely all the flips are dominated by a single coin, c) how much internal correlation there is between each coin flip.
And my gut is like, it seems like you get a fair number of flips, it’s reasonably likely but not certain that one (or a few) flips dominate, and the internal correlation is high but not 1(and not very close to 1).
There’s a few more thoughts I have but that’s the general gist. Unfortunately it’s not very mathematical/quantitive or much of a model; my guess is that both more conceptual thinking and more precise models can yield some more clarity, but ultimately we (or at least I) will still end up fairly confused even after that.
I’m also interested in thoughts from other people here; I’m sure I’m not the only person who is worried about this type of thing.
(Also please don’t buy my exact probabilities. They are very much not resilient. Like I’m pretty sure if I thought about it for 10 years (without new empirical information) the probability can’t be much higher than 90%, and I’m pretty sure the probabilities are high enough to be non-Pascalian, so not as low as say 50% + 1-in-a-quadrallion, but anywhere in between seems kinda defensible).
“I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future”
Fwiw, I think this is probably true for very few if any of the EAs I’ve worked with, though that’s a biased sample.
I wonder if the thing giving you this vibe might be they they actually think something like “I’m not that confident that my work is net positive for the LTF but my best guess is that it’s net positive in expectation. If what I’m doing is not positive, there’s no cheap way for me to figure it out, so I am confident (though not ~100%) that my work will keep seeming positive EV to me for the near future.” One informal way to describe this is that they are confident that their work is net positive in expectation/ex ante but not that it will be net positive ex post
I think this can look a lot like somebody being ~sure that what they’re doing is net positive even if in fact they are pretty uncertain.
One way I think about this is there are just so many weird (positive and negative) feedback loops and indirect effects, so it’s really hard to know if any particular action is good or bad. Let’s say you fund a promising-seeming area of alignment research – just off the top of my head, here are several ways that grant could backfire:
• the research appears promising but turns out not to be, but in the meantime it wastes the time of other alignment researchers who otherwise would’ve gone into other areas
• the research area is promising in general, but the particular framing used by the researcher you funded is confusing, and that leads to slower progress than counterfactually
• the researcher you funded (unbeknownst to you) turns out to be toxic or otherwise have bad judgment, and by funding him, you counterfactually poison the well on this line of research
• the area you fund sees progress and grows, which counterfactually sucks up lots of longtermist money that otherwise would have been invested and had greater effect (say, during crunch time)
• the research is somewhat safety-enhancing, to the point that labs (facing safety-capabilities tradeoffs) decide to push capabilities further than they otherwise would, and safety is hurt on net
• the research is somewhat safety-enhancing, to the point that it prevents a warning shot, and that warning shot would have been the spark that would have inspired humanity to get its game together regarding combatting AI X-risk
• the research advances capabilities, either directly or indirectly
• the research is exciting and draws the attention of other researchers into the field, but one of those researchers happens to have a huge, tail negative effect on the field outweighing all the other benefits (say, that particular researcher has a very extreme version of one of the above bullet points)
• Etcetera – I feel like I could do this all day.
Some of the above are more likely than others, but there are just so many different possible ways that any particular intervention could wind up being net negative (and also, by the same token, could alternatively have indirect positive effects that are similarly large and hard to predict).
Having said that, it seems to me that on the whole, we’re probably better off if we’re funding promising-seeming alignment research (for example), and grant applications should be evaluated within that context. On the specific question of safety-conscious work leading to faster capabilities gains, insofar as we view AI as a race between safety and capabilities, it seems to me that if we never advanced alignment research, capabilities would be almost sure to win the race, and while safety research might bring about misaligned AGI somewhat sooner than it otherwise would occur, I have a hard time seeing how it would predictably increase the chances of misaligned AGI eventually being created.
I’m not sure which of the people “have ties to dangerous organizations such as Anthropic” in the post (besides Shauna Kravec & Nova DasSarma, who work at Anthropic), but of the current fund managers, I suspect that I have the most direct ties to Anthropic and OAI through my work at ARC Evals. I also have done a plurality of grant evaluations in AI Safety in the last month. So I think I should respond to this comment with my thoughts.
I personally empathize significantly with the concerns raised by Linch and Oli. In fact, when I was debating joining Evals last November, my main reservations centered around direct capabilities externalities and safety washing.
I will say the following facts about AI Safety advancing capabilities:
Empirically, when we look at previous capability advancements produced by people working in the name of “AI Safety” from this community, the overwhelming majority were produced by people who were directly aiming to improve capabilities.
That is, they were not capability externalities from safety research, so much as direct capabilities work.
E.g, it definitely was not the case that GPT-3 was a side effect of alignment research, and OAI and Anthropic are both orgs who explicitly focus on scaling and keeping at the frontier of AI development.
I think the sole exception are a few people who started doing applied RLHF research. Yeah, I think the people who made LLMs commercially viable via did not do a good thing. My main uncertainty is what exactly happened here and how much we contribute to this on the margin.
I generally think that research is significantly more useful when it is targeted (this is a very common view in the community as well). I’m not sure what the exact multiplier is, but I think targeted, non-foundational research is probably 10x more effective than incidentally related research. So the net impact of safety research on capabilities via externalities is probably significantly smaller than the impact of safety research on safety research, or the impact of targeted capabilities research on capabilities research.
I think this point is often overstated or overrated, but the scale of capabilities researchers at this point is really big, and it’s easy to overestimate the impact of one or two particular high profile people.
For what it’s worth, I think that if we are to actually produce good independent alignment research, we need to fund it, and LTFF is basically the only funder in this space. My current guess is a lack of LTFF funding is probably producing more researchers at Anthropic than otherwise, because there just that aren’t many opportunities for people to work on safety or safety-adjacent roles. E.g. I know of people who are interviewing for Anthropic capability teams because idk man, they just want a safety-adjacent job with a minimal amount of security, and it’s what’s available. Having spoken to a bunch of people, I strongly suspect that of the people that I’d want to fund but won’t be funded, at least a good fraction are significantly less likely to join a scaling lab if they were funded, and not more.
(Another possibly helpful datapoint here is that I received an offer from Anthropic last december, and I turned them down.)
I think this is true at the current margin, because we have so limited money.. But if we receive say enough funding to lower the bar to closer to what our early 2023 bar was, I will still want to make skill-up grants to fairly talented/promising people, and I still think they are quite cost-effective. I do expect those grants to have more capabilities externalities (at least in terms of likelihood, maybe in expectation as well) than when we give grants to people who currently could be hired at (eg) Anthropic but choose not to.
It’s possible you (and maybe Oli?) disagree and think we should fund moderate-to-good direct work projects over all (or almost all) skillup grants; in that case this is a substantive disagreement about what we should do in the future.
That feels concerning. Are there any obvious things that would help with this situation, eg: better career planning and reflection resources for people in this situation, AI safety folks being more clear about what they see as the value/disvalue of working in those types of capability roles?
Seems weird for someone to explicitly want a “safety-adjacent” job unless there are weird social dynamics encouraging people to do that even when there isn’t positive impact to be had from such a job.
FWIW, I am also very worried about this and it feels pretty plausible to me. I don’t have any great reassurances, besides me thinking about this a lot and trying somewhat hard to counteract it in my own grant evaluations, but I only do a small minority of grant evaluations on the LTFF these days.
I do want to clarify that I think it’s unlikely that AI Safety is a front for advancing AI capabilities. I think the framing that’s more plausibly true is that AI Safety is a memespace that has undergone regulatory capture by capability companies and people in the EA network to primarily build out their own influence over the world.
Their worldviews is of course heavily influenced by concerns about the future of humanity and how it will interact with AI, but in a way that primarily leverages symmetric weapons and does not involve much of any accountability or public reasoning about their risk models, which seem substantially skewed by the fact that people are making billions of dollars off of advances in AI capabilities, and are substantially worried that people they don’t like will get to control AI.
I do also think this is just one framing, and there are a lot of other things going on.
Have you looked at Orthogonal? They’re pretty damn culturally inoculated against doing-capabilities-(even-by-accident), and they’re extremely funding constrained.