tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn’t really based on this post; the post just inspired me to write something.)
Doing things that feel good—and look good to many ethics-minded observers—but are more motivated by purity than seeking to do as much good as possible and thus likely to be much less valuable than the best way to do good (on the margin)
Focusing on avoiding doing harm yourself, rather than focusing on net good or noticing how your actions affect others[2] (related concept: neglecting inaction risk)
I’m worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.
There are several stories you can tell about how working on AI welfare soon will be a big deal for the long-term future (like, worth >>10^60 happy human lives):
If we’re accidentally torturing AI systems, they’re more likely to take catastrophic actions. We should try to verify that AIs are ok with their situation and take it seriously if not.[3]
It would improve safety if we were able to pay or trade with near-future potentially-misaligned AI, but we’re not currently able to, likely including because we don’t understand AI-welfare-adjacent stuff well enough.
[Decision theory mumble mumble.]
Also just “shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run,” somehow.
[More.]
But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives. Numbers aside, the focus is we should avoid causing a moral catastrophe in our own deployments and on merely Earth-scale stuff, not we should increase the chance that long-term AI welfare and the cosmic endowment go well. Likewise, this post suggests efforts to “protect any interests that warrant protecting” and “make interventions and concessions for model welfare” at ASL-4. I’m very glad that this post mentions that doing so could be too costly, but I think very few resources (that trade off with improving safety) should go into improving short-term AI welfare (unless you’re actually trying to improve the long-term future somehow) and most people (including most of the Anthropic people I’ve heard from) aren’t thinking through the tradeoff. Shut up and multiply; treat the higher-stakes thing as proportionately more important.[4] (And notice inaction risk.)
(Plucking low-hanging fruit for short-term AI welfare is fine as long as it isn’t so costly and doesn’t crowd out more important AI welfare work.)
I worry Anthropic is both missing an opportunity to do astronomical good in expectation via AI welfare work and setting itself up to sacrifice a lot for merely-Earth-scale AI welfare.
One might reply: Zach is worried about the long-term, but Sam is just talking about decisions Anthropic will have to make short-term; this is fine. To be clear, my worry is Anthropic will be much too concerned with short-term AI welfare, and so it will make sacrifices (labor, money, interfering with deployments) for short-term AI welfare, and these sacrifices will make Anthropic substantially less competitive and slightly worse on safety, and this increases P(doom).
I wanted to make this point before reading this post; this post just inspired me to write it, despite not being a great example of the attitude I’m worried about since it mentions how the costs of improving short-term AI welfare might be too great. (But it does spend two subsections on short-term AI welfare, which suggests that the author is much too concerned with short-term AI welfare [relative to other things you could invest effort into], according to me.)
E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.
For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.
I agree that we generally shouldn’t trade off risk of permanent civilization-ending catastrophe for Earth-scale AI welfare, but I just really would defend the line that addressing short-term AI welfare is important for both long-term existential risk and long-term AI welfare. One reason as to why that you don’t mention: AIs are extremely influenced by what they’ve seen other AIs in their training data do and how they’ve seen those AIs be treated—cf. some of Janus’s writing or Conditioning Predictive Models.
Sure, good point. But it’s far from obvious that the best interventions long-term-wise are the best short-term-wise, and I believe people are mostly just thinking about short-term stuff. I’d feel better if people talked about training data or whatever rather than just “protect any interests that warrant protecting” and “make interventions and concessions for model welfare.”
(As far as I remember, nobody’s published a list of how short-term AI welfare stuff can boost long-term AI welfare stuff that includes the training-data thing you mention. This shows that people aren’t thinking about long-term stuff. Actually there hasn’t been much published on short-term stuff either, so: shrug.)
But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives.
You have to be a pretty committed scope-sensitive consequentialist to disagree with this. What if they actually risked torturing 1M or 1B people? That seems terrible and unacceptable, and by assumption AI suffering is equivalent to human suffering. I think our societal norms are such that unacceptable things regularly become acceptable when the stakes are clear so you may not even lose much utility from this emphasis on avoiding suffering.
It seems perfectly compatible with good decision-making that there are criteria A and B, A is much more important and therefore prioritized over B, and 2 out of 19 sections are focused on B. The real question is whether the organization’s leadership is able to make difficult tradeoffs, reassessing and questioning requirements as new information comes in. For example, in the 1944 Norwegian sabotage of a Nazi German heavy water shipment, stopping the Nazi nuclear program was the first priority. The mission went ahead with reasonable effort to minimize casualties and 14 civilians died anyway, less than it could have been. It would not really have alarmed me to see a document discussing 19 efforts with 2 being avoidance of casualties, nor to know that the planners regularly talked with the vibe that 10-100 civilian casualties should be avoided, as long as someone had their eye on the ball.
Note that people who have a non-consequentialist aversion for risk of causing damage should have other problems with working for Anthropic. E.g. I suspect that Anthropic is responsible for more than a million deaths of currently-alive humans in expectation.
This is just the paralysis argument. (Maybe any sophisticated non-consequentialists will have to avoid this anyway. Maybe this shows that non-consequentialism is unappealing.)
[Edit after Buck’s reply: I think it’s weaker because most Anthropic employees aren’t causing the possible-deaths, just participating in a process that might cause deaths.]
Mostly from Anthropic building AIs that then kill billions of people while taking over, or their algorithmic secrets being stolen and leading to other people building AIs that then kill billions of people, or their model weights being stolen and leading to huge AI-enabled wars.
I started out disagreeing with where I thought this comment was going, but I think ended up reasonably sold by the end.
I want to flag something like “in any ‘normal’ circumstances, avoiding an earth-sized or even nation-sized moral-catastrophe is, like, really important?” I think it’s… might actually be correct to actually do some amount of hang-wringing about that even if you know you’re ultimately going to have to make the tradeoff against it? (mostly out of a general worry about being too quick to steamroll your moral intuitions with math).
But, yeah the circumstances aren’t normal, and seems likely there’s at least some tradeoff here.
I am generally pleasantly surprised that AI welfare is one (at least one (relatively?) senior) Anthropic employee’s roadmap at all.
I wasn’t expecting it to be there at all. (Though I’m sort of surprised an Anthropic folk is publicly talking about AI welfare but still not explicitly extinction risk)
To say the obvious thing: I think if Anthropic isn’t able to make at least somewhat-roughly-meaningful predictions about AI welfare, then their core current public research agendas have failed?
tl;dr: I think Anthropic is on track to trade off nontrivial P(win) to improve short-term AI welfare,[1] and this seems bad and confusing to me. (This worry isn’t really based on this post; the post just inspired me to write something.)
Anthropic buys carbon offsets to be carbon-neutral. Carbon-offset mindset involves:
Doing things that feel good—and look good to many ethics-minded observers—but are more motivated by purity than seeking to do as much good as possible and thus likely to be much less valuable than the best way to do good (on the margin)
Focusing on avoiding doing harm yourself, rather than focusing on net good or noticing how your actions affect others[2] (related concept: neglecting inaction risk)
I’m worried that Anthropic will be in carbon-offset mindset with respect to AI welfare.
There are several stories you can tell about how working on AI welfare soon will be a big deal for the long-term future (like, worth >>10^60 happy human lives):
If we’re accidentally torturing AI systems, they’re more likely to take catastrophic actions. We should try to verify that AIs are ok with their situation and take it seriously if not.[3]
It would improve safety if we were able to pay or trade with near-future potentially-misaligned AI, but we’re not currently able to, likely including because we don’t understand AI-welfare-adjacent stuff well enough.
[Decision theory mumble mumble.]
Also just “shaping norms in a way that leads to lasting changes in how humanity chooses to treat digital minds in the very long run,” somehow.
[More.]
But when I hear Anthropic people (and most AI safety people) talk about AI welfare, the vibe is like it would be unacceptable to incur a [1% or 5% or so] risk of a deployment accidentally creating AI suffering worse than 10^[6 or 9 or so] suffering-filled human lives. Numbers aside, the focus is we should avoid causing a moral catastrophe in our own deployments and on merely Earth-scale stuff, not we should increase the chance that long-term AI welfare and the cosmic endowment go well. Likewise, this post suggests efforts to “protect any interests that warrant protecting” and “make interventions and concessions for model welfare” at ASL-4. I’m very glad that this post mentions that doing so could be too costly, but I think very few resources (that trade off with improving safety) should go into improving short-term AI welfare (unless you’re actually trying to improve the long-term future somehow) and most people (including most of the Anthropic people I’ve heard from) aren’t thinking through the tradeoff. Shut up and multiply; treat the higher-stakes thing as proportionately more important.[4] (And notice inaction risk.)
(Plucking low-hanging fruit for short-term AI welfare is fine as long as it isn’t so costly and doesn’t crowd out more important AI welfare work.)
I worry Anthropic is both missing an opportunity to do astronomical good in expectation via AI welfare work and setting itself up to sacrifice a lot for merely-Earth-scale AI welfare.
One might reply: Zach is worried about the long-term, but Sam is just talking about decisions Anthropic will have to make short-term; this is fine. To be clear, my worry is Anthropic will be much too concerned with short-term AI welfare, and so it will make sacrifices (labor, money, interfering with deployments) for short-term AI welfare, and these sacrifices will make Anthropic substantially less competitive and slightly worse on safety, and this increases P(doom).
I wanted to make this point before reading this post; this post just inspired me to write it, despite not being a great example of the attitude I’m worried about since it mentions how the costs of improving short-term AI welfare might be too great. (But it does spend two subsections on short-term AI welfare, which suggests that the author is much too concerned with short-term AI welfare [relative to other things you could invest effort into], according to me.)
I like and appreciate this post.
Or—worse—to avoid being the ones to cause short-term AI suffering.
E.g. failing to take seriously the possibility that you make yourself uncompetitive and your influence and market share just goes to less scrupulous companies.
Related to this and the following bullet: Ryan Greenblatt’s ideas.
For scope-sensitive consequentialists—at least—short-term AI welfare stuff is a rounding error and thus a red herring, except for its effects on the long-term future.
I agree that we generally shouldn’t trade off risk of permanent civilization-ending catastrophe for Earth-scale AI welfare, but I just really would defend the line that addressing short-term AI welfare is important for both long-term existential risk and long-term AI welfare. One reason as to why that you don’t mention: AIs are extremely influenced by what they’ve seen other AIs in their training data do and how they’ve seen those AIs be treated—cf. some of Janus’s writing or Conditioning Predictive Models.
Sure, good point. But it’s far from obvious that the best interventions long-term-wise are the best short-term-wise, and I believe people are mostly just thinking about short-term stuff. I’d feel better if people talked about training data or whatever rather than just “protect any interests that warrant protecting” and “make interventions and concessions for model welfare.”
(As far as I remember, nobody’s published a list of how short-term AI welfare stuff can boost long-term AI welfare stuff that includes the training-data thing you mention.
This shows that people aren’t thinking about long-term stuff.Actually there hasn’t been much published on short-term stuff either, so: shrug.)You have to be a pretty committed scope-sensitive consequentialist to disagree with this. What if they actually risked torturing 1M or 1B people? That seems terrible and unacceptable, and by assumption AI suffering is equivalent to human suffering. I think our societal norms are such that unacceptable things regularly become acceptable when the stakes are clear so you may not even lose much utility from this emphasis on avoiding suffering.
It seems perfectly compatible with good decision-making that there are criteria A and B, A is much more important and therefore prioritized over B, and 2 out of 19 sections are focused on B. The real question is whether the organization’s leadership is able to make difficult tradeoffs, reassessing and questioning requirements as new information comes in. For example, in the 1944 Norwegian sabotage of a Nazi German heavy water shipment, stopping the Nazi nuclear program was the first priority. The mission went ahead with reasonable effort to minimize casualties and 14 civilians died anyway, less than it could have been. It would not really have alarmed me to see a document discussing 19 efforts with 2 being avoidance of casualties, nor to know that the planners regularly talked with the vibe that 10-100 civilian casualties should be avoided, as long as someone had their eye on the ball.
Note that people who have a non-consequentialist aversion for risk of causing damage should have other problems with working for Anthropic. E.g. I suspect that Anthropic is responsible for more than a million deaths of currently-alive humans in expectation.
This is just the paralysis argument. (Maybe any sophisticated non-consequentialists will have to avoid this anyway. Maybe this shows that non-consequentialism is unappealing.)
[Edit after Buck’s reply: I think it’s weaker because most Anthropic employees aren’t causing the possible-deaths, just participating in a process that might cause deaths.]
I think it’s a bit stronger than the usual paralysis argument in this case, but yeah.
Can you elaborate on how the million deaths would result?
Mostly from Anthropic building AIs that then kill billions of people while taking over, or their algorithmic secrets being stolen and leading to other people building AIs that then kill billions of people, or their model weights being stolen and leading to huge AI-enabled wars.
I started out disagreeing with where I thought this comment was going, but I think ended up reasonably sold by the end.
I want to flag something like “in any ‘normal’ circumstances, avoiding an earth-sized or even nation-sized moral-catastrophe is, like, really important?” I think it’s… might actually be correct to actually do some amount of hang-wringing about that even if you know you’re ultimately going to have to make the tradeoff against it? (mostly out of a general worry about being too quick to steamroll your moral intuitions with math).
But, yeah the circumstances aren’t normal, and seems likely there’s at least some tradeoff here.
I am generally pleasantly surprised that AI welfare is one (at least one (relatively?) senior) Anthropic employee’s roadmap at all.
I wasn’t expecting it to be there at all. (Though I’m sort of surprised an Anthropic folk is publicly talking about AI welfare but still not explicitly extinction risk)
To say the obvious thing: I think if Anthropic isn’t able to make at least somewhat-roughly-meaningful predictions about AI welfare, then their core current public research agendas have failed?