So basically I don’t think it’s possible to do robustly positive actions in longtermism with high (>70%? >60%?) probability of being net positive for the long-term future
This seems like an important point, and it’s one I’ve not heard before. (At least, not outside of cluelessness or specific concerns around AI safety speeding up capabilities; I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future.)
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little? For instance, is there a theoretical argument going on here, like a weak form of cluelessness? Or is it more empirical, for example, did you get here through evaluating a bunch of grants and noticing that even the best seem to carry 30-ish percent downside risk? Something else?
I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future).
Really? Without giving away names, can you tell me roughly what cluster they are in? Geographical area, age range, roughly what vocation (technical AI safety/AI policy/biosecurity/community building/earning-to-give)?
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little? For instance, is there a theoretical argument going on here, like a weak form of cluelessness? Or is it more empirical,
Definitely closer to the former than the latter! Here are some steps in my thought process:
The standard longtermist cluelessness arguments (“you can’t be sure if eg improving labor laws in India is good because it has uncertain effects on the population and happiness of people in Alpha Centauri in the year 4000”) doesn’t apply in full-force if you buy high near-term (10-100 years) probability of AI doom, and that AI doom is astonomically bad and avoidable.
or (less commonly on LW but more common in some other EA circles) other sources of hinge of history like totalitarian lock-in, s-risks, biological tech doom, etc
If you assign low credence in any hinge of history hypothesis, I think you are still screwed by the standard cluelessness arguments, unfortunately.
But even with a belief in x-risk hinge of history, cluelessness still apply significantly. Knowing whether an action reduces x-risk is much easier in relative terms than knowing whether an action will improve the far future in the absence of x-risk, but it’s still hard in absolute terms.
If we drill down on a specific action and a specific theory of change (“I want to convince a specific Senator to sign a specific bill to regulate the size of LLM models trained in 2024”, “I want to do this type of technical research to understand this particular bug in this class of transformer models, because better understanding of this bug can differentially advance alignment over capabilities at Anthropic if Anthropic will scale up this type of model”), any particular action’s impact is just built on a tower of conjunctions and it’s really hard to get any grounding to seriously argue that it’s probably positive.
So how do you get any robustness? You imagine the set of all your actions as slightly positive bets/positively biased coin flips (eg a grantmaker might investigate 100+ grants in a year, something like deconfusion research might yield a number of different positive results, field-building for safety might cause a number of different positive outcomes, you can earn-to-give for multiple longtermist orgs, etc). If heads are “+1” and tails are “-1″, and you have a lot of flips, then the central limit theorem gets you a nice normal distribution with a positive mean and thin tails.
Unfortunately the real world is a lot less nice than this because:
A concrete example is that maybe a really unexpectedly bad grant can wipe out all of the positive impact your good grants have gotten, and then some.
the impact and theories of change of all your actions likely share a worldview and have internal correlations
eg, “longtermist EA fieldbuilding” have multiple theories of impact, but you can be wrong about a few important things and e.g. (almost) all of them might end up differentially advancing capabilities over alignment, in very correlated ways.
You might not have all that many flips that matter
The real world is finite, your life is finite, etc, so even if in the limit your approach is net positive, there’s no guarantee that in practice your actions are net positive before either you die or the singularity happens.
That doesn’t mean it’s wrong to dedicate your life to a single really important bet! (as long as you are obeying reasonable deontological and virtue ethics constraints, you’re trying your best to be reasonable, etc).
For people in those shoes, a possibly helpful mental motion is to try to think less of individual impact and more communally. Maybe it’s like voting: individual votes are ~useless but collectively people-who-think-like-you can hopefully vote for a good leader. If enough people-like-you follow an algorithm of “do unlikely-to-work research projects that are slightly positive in expectation”, collectively we can do something important.
probably a few other things I’m missing.
So the central modeling issues become a) how many flips you get, b) how likely all the flips are dominated by a single coin, c) how much internal correlation there is between each coin flip.
And my gut is like, it seems like you get a fair number of flips, it’s reasonably likely but not certain that one (or a few) flips dominate, and the internal correlation is high but not 1(and not very close to 1).
There’s a few more thoughts I have but that’s the general gist. Unfortunately it’s not very mathematical/quantitive or much of a model; my guess is that both more conceptual thinking and more precise models can yield some more clarity, but ultimately we (or at least I) will still end up fairly confused even after that.
I’m also interested in thoughts from other people here; I’m sure I’m not the only person who is worried about this type of thing.
(Also please don’t buy my exact probabilities. They are very much not resilient. Like I’m pretty sure if I thought about it for 10 years (without new empirical information) the probability can’t be much higher than 90%, and I’m pretty sure the probabilities are high enough to be non-Pascalian, so not as low as say 50% + 1-in-a-quadrallion, but anywhere in between seems kinda defensible).
“I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future”
Fwiw, I think this is probably true for very few if any of the EAs I’ve worked with, though that’s a biased sample.
I wonder if the thing giving you this vibe might be they they actually think something like “I’m not that confident that my work is net positive for the LTF but my best guess is that it’s net positive in expectation. If what I’m doing is not positive, there’s no cheap way for me to figure it out, so I am confident (though not ~100%) that my work will keep seeming positive EV to me for the near future.” One informal way to describe this is that they are confident that their work is net positive in expectation/ex ante but not that it will be net positive ex post
I think this can look a lot like somebody being ~sure that what they’re doing is net positive even if in fact they are pretty uncertain.
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little?
One way I think about this is there are just so many weird (positive and negative) feedback loops and indirect effects, so it’s really hard to know if any particular action is good or bad. Let’s say you fund a promising-seeming area of alignment research – just off the top of my head, here are several ways that grant could backfire: • the research appears promising but turns out not to be, but in the meantime it wastes the time of other alignment researchers who otherwise would’ve gone into other areas
• the research area is promising in general, but the particular framing used by the researcher you funded is confusing, and that leads to slower progress than counterfactually
• the researcher you funded (unbeknownst to you) turns out to be toxic or otherwise have bad judgment, and by funding him, you counterfactually poison the well on this line of research
• the area you fund sees progress and grows, which counterfactually sucks up lots of longtermist money that otherwise would have been invested and had greater effect (say, during crunch time)
• the research is somewhat safety-enhancing, to the point that labs (facing safety-capabilities tradeoffs) decide to push capabilities further than they otherwise would, and safety is hurt on net
• the research is somewhat safety-enhancing, to the point that it prevents a warning shot, and that warning shot would have been the spark that would have inspired humanity to get its game together regarding combatting AI X-risk
• the research advances capabilities, either directly or indirectly
• the research is exciting and draws the attention of other researchers into the field, but one of those researchers happens to have a huge, tail negative effect on the field outweighing all the other benefits (say, that particular researcher has a very extreme version of one of the above bullet points)
• Etcetera – I feel like I could do this all day.
Some of the above are more likely than others, but there are just so many different possible ways that any particular intervention could wind up being net negative (and also, by the same token, could alternatively have indirect positive effects that are similarly large and hard to predict).
Having said that, it seems to me that on the whole, we’re probably better off if we’re funding promising-seeming alignment research (for example), and grant applications should be evaluated within that context. On the specific question of safety-conscious work leading to faster capabilities gains, insofar as we view AI as a race between safety and capabilities, it seems to me that if we never advanced alignment research, capabilities would be almost sure to win the race, and while safety research might bring about misaligned AGI somewhat sooner than it otherwise would occur, I have a hard time seeing how it would predictably increase the chances of misaligned AGI eventually being created.
This seems like an important point, and it’s one I’ve not heard before. (At least, not outside of cluelessness or specific concerns around AI safety speeding up capabilities; I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future.)
I’m super interested in how you might have arrived at this belief: would you be able to elaborate a little? For instance, is there a theoretical argument going on here, like a weak form of cluelessness? Or is it more empirical, for example, did you get here through evaluating a bunch of grants and noticing that even the best seem to carry 30-ish percent downside risk? Something else?
Really? Without giving away names, can you tell me roughly what cluster they are in? Geographical area, age range, roughly what vocation (technical AI safety/AI policy/biosecurity/community building/earning-to-give)?
Definitely closer to the former than the latter! Here are some steps in my thought process:
The standard longtermist cluelessness arguments (“you can’t be sure if eg improving labor laws in India is good because it has uncertain effects on the population and happiness of people in Alpha Centauri in the year 4000”) doesn’t apply in full-force if you buy high near-term (10-100 years) probability of AI doom, and that AI doom is astonomically bad and avoidable.
or (less commonly on LW but more common in some other EA circles) other sources of hinge of history like totalitarian lock-in, s-risks, biological tech doom, etc
If you assign low credence in any hinge of history hypothesis, I think you are still screwed by the standard cluelessness arguments, unfortunately.
But even with a belief in x-risk hinge of history, cluelessness still apply significantly. Knowing whether an action reduces x-risk is much easier in relative terms than knowing whether an action will improve the far future in the absence of x-risk, but it’s still hard in absolute terms.
If we drill down on a specific action and a specific theory of change (“I want to convince a specific Senator to sign a specific bill to regulate the size of LLM models trained in 2024”, “I want to do this type of technical research to understand this particular bug in this class of transformer models, because better understanding of this bug can differentially advance alignment over capabilities at Anthropic if Anthropic will scale up this type of model”), any particular action’s impact is just built on a tower of conjunctions and it’s really hard to get any grounding to seriously argue that it’s probably positive.
So how do you get any robustness? You imagine the set of all your actions as slightly positive bets/positively biased coin flips (eg a grantmaker might investigate 100+ grants in a year, something like deconfusion research might yield a number of different positive results, field-building for safety might cause a number of different positive outcomes, you can earn-to-give for multiple longtermist orgs, etc). If heads are “+1” and tails are “-1″, and you have a lot of flips, then the central limit theorem gets you a nice normal distribution with a positive mean and thin tails.
Unfortunately the real world is a lot less nice than this because:
the impact of your different actions are heavy-tailed, likely in both directions.
A concrete example is that maybe a really unexpectedly bad grant can wipe out all of the positive impact your good grants have gotten, and then some.
the impact and theories of change of all your actions likely share a worldview and have internal correlations
eg, “longtermist EA fieldbuilding” have multiple theories of impact, but you can be wrong about a few important things and e.g. (almost) all of them might end up differentially advancing capabilities over alignment, in very correlated ways.
You might not have all that many flips that matter
The real world is finite, your life is finite, etc, so even if in the limit your approach is net positive, there’s no guarantee that in practice your actions are net positive before either you die or the singularity happens.
That doesn’t mean it’s wrong to dedicate your life to a single really important bet! (as long as you are obeying reasonable deontological and virtue ethics constraints, you’re trying your best to be reasonable, etc).
For people in those shoes, a possibly helpful mental motion is to try to think less of individual impact and more communally. Maybe it’s like voting: individual votes are ~useless but collectively people-who-think-like-you can hopefully vote for a good leader. If enough people-like-you follow an algorithm of “do unlikely-to-work research projects that are slightly positive in expectation”, collectively we can do something important.
probably a few other things I’m missing.
So the central modeling issues become a) how many flips you get, b) how likely all the flips are dominated by a single coin, c) how much internal correlation there is between each coin flip.
And my gut is like, it seems like you get a fair number of flips, it’s reasonably likely but not certain that one (or a few) flips dominate, and the internal correlation is high but not 1(and not very close to 1).
There’s a few more thoughts I have but that’s the general gist. Unfortunately it’s not very mathematical/quantitive or much of a model; my guess is that both more conceptual thinking and more precise models can yield some more clarity, but ultimately we (or at least I) will still end up fairly confused even after that.
I’m also interested in thoughts from other people here; I’m sure I’m not the only person who is worried about this type of thing.
(Also please don’t buy my exact probabilities. They are very much not resilient. Like I’m pretty sure if I thought about it for 10 years (without new empirical information) the probability can’t be much higher than 90%, and I’m pretty sure the probabilities are high enough to be non-Pascalian, so not as low as say 50% + 1-in-a-quadrallion, but anywhere in between seems kinda defensible).
“I’m pretty sure that most EAs I know have ~100% confidence that what they’re doing is net positive for the long-term future”
Fwiw, I think this is probably true for very few if any of the EAs I’ve worked with, though that’s a biased sample.
I wonder if the thing giving you this vibe might be they they actually think something like “I’m not that confident that my work is net positive for the LTF but my best guess is that it’s net positive in expectation. If what I’m doing is not positive, there’s no cheap way for me to figure it out, so I am confident (though not ~100%) that my work will keep seeming positive EV to me for the near future.” One informal way to describe this is that they are confident that their work is net positive in expectation/ex ante but not that it will be net positive ex post
I think this can look a lot like somebody being ~sure that what they’re doing is net positive even if in fact they are pretty uncertain.
One way I think about this is there are just so many weird (positive and negative) feedback loops and indirect effects, so it’s really hard to know if any particular action is good or bad. Let’s say you fund a promising-seeming area of alignment research – just off the top of my head, here are several ways that grant could backfire:
• the research appears promising but turns out not to be, but in the meantime it wastes the time of other alignment researchers who otherwise would’ve gone into other areas
• the research area is promising in general, but the particular framing used by the researcher you funded is confusing, and that leads to slower progress than counterfactually
• the researcher you funded (unbeknownst to you) turns out to be toxic or otherwise have bad judgment, and by funding him, you counterfactually poison the well on this line of research
• the area you fund sees progress and grows, which counterfactually sucks up lots of longtermist money that otherwise would have been invested and had greater effect (say, during crunch time)
• the research is somewhat safety-enhancing, to the point that labs (facing safety-capabilities tradeoffs) decide to push capabilities further than they otherwise would, and safety is hurt on net
• the research is somewhat safety-enhancing, to the point that it prevents a warning shot, and that warning shot would have been the spark that would have inspired humanity to get its game together regarding combatting AI X-risk
• the research advances capabilities, either directly or indirectly
• the research is exciting and draws the attention of other researchers into the field, but one of those researchers happens to have a huge, tail negative effect on the field outweighing all the other benefits (say, that particular researcher has a very extreme version of one of the above bullet points)
• Etcetera – I feel like I could do this all day.
Some of the above are more likely than others, but there are just so many different possible ways that any particular intervention could wind up being net negative (and also, by the same token, could alternatively have indirect positive effects that are similarly large and hard to predict).
Having said that, it seems to me that on the whole, we’re probably better off if we’re funding promising-seeming alignment research (for example), and grant applications should be evaluated within that context. On the specific question of safety-conscious work leading to faster capabilities gains, insofar as we view AI as a race between safety and capabilities, it seems to me that if we never advanced alignment research, capabilities would be almost sure to win the race, and while safety research might bring about misaligned AGI somewhat sooner than it otherwise would occur, I have a hard time seeing how it would predictably increase the chances of misaligned AGI eventually being created.